WEBVTT

00:00:07.961 --> 00:00:11.075
- Okay. Can everyone hear me?
Okay.

00:00:11.075 --> 00:00:12.153
Sorry for the delay.

00:00:12.153 --> 00:00:13.563
I had a bit of technical difficulty.

00:00:13.563 --> 00:00:15.891
Today was the first time
I was trying to use my

00:00:15.891 --> 00:00:17.843
new touch bar Mac book pro for presenting,

00:00:17.843 --> 00:00:19.433
and none of the adapters are working.

00:00:19.433 --> 00:00:21.814
So, I had to switch
laptops at the last minute.

00:00:21.814 --> 00:00:22.731
So, thanks.

00:00:24.003 --> 00:00:25.353
Sorry about that.

00:00:25.353 --> 00:00:27.451
So, today is lecture 10.

00:00:27.451 --> 00:00:30.420
We're talking about
recurrent neural networks.

00:00:30.420 --> 00:00:33.003
So, as of, as usual, a
couple administrative notes.

00:00:33.003 --> 00:00:34.835
So, We're working hard

00:00:35.891 --> 00:00:37.353
on assignment one grading.

00:00:37.353 --> 00:00:38.913
Those grades will probably be out

00:00:38.913 --> 00:00:40.921
sometime later today.

00:00:40.921 --> 00:00:42.531
Hopefully, they can get out

00:00:42.531 --> 00:00:44.382
before the A2 deadline.

00:00:44.382 --> 00:00:46.251
That's what I'm hoping for.

00:00:46.251 --> 00:00:50.361
On a related note, Assignment
two is due today at 11:59 p.m.

00:00:50.361 --> 00:00:53.111
so, who's done with that already?

00:00:55.280 --> 00:00:56.633
About half you guys.

00:00:56.633 --> 00:00:58.273
So, you remember, I did warn you

00:00:58.273 --> 00:00:59.251
when the assignment went out

00:00:59.251 --> 00:01:01.091
that it was quite long, to start early.

00:01:01.091 --> 00:01:03.811
So, you were warned about that.

00:01:03.811 --> 00:01:06.561
But, hopefully, you guys
have some late days left.

00:01:06.561 --> 00:01:08.160
Also, as another reminder,

00:01:08.160 --> 00:01:10.531
the midterm will be in class on Tuesday.

00:01:10.531 --> 00:01:12.321
If you kind of look
around the lecture hall,

00:01:12.321 --> 00:01:13.782
there are not enough seats in this room

00:01:13.782 --> 00:01:15.881
to seat all the enrolled
students in the class.

00:01:15.881 --> 00:01:18.113
So, we'll actually be having the midterm

00:01:18.113 --> 00:01:20.062
in several other lecture
halls across campus.

00:01:20.062 --> 00:01:21.702
And we'll be sending out some more details

00:01:21.702 --> 00:01:26.099
on exactly where to go in
the next couple of days.

00:01:26.099 --> 00:01:28.179
So a bit of a, another
bit of announcement.

00:01:28.179 --> 00:01:29.161
We've been working on this sort of

00:01:29.161 --> 00:01:31.169
fun bit of extra credit
thing for you to play with

00:01:31.169 --> 00:01:33.249
that we're calling the training game.

00:01:33.249 --> 00:01:34.950
This is this cool
browser based experience,

00:01:34.950 --> 00:01:36.159
where you can go in

00:01:36.159 --> 00:01:37.480
and interactively train neural networks

00:01:37.480 --> 00:01:39.927
and tweak the hyper
parameters during training.

00:01:39.927 --> 00:01:42.289
And this should be a
really cool interactive way

00:01:42.289 --> 00:01:43.219
for you to practice

00:01:43.219 --> 00:01:45.395
some of these hyper
parameter tuning skills

00:01:45.395 --> 00:01:47.646
that we've been talking about
the last couple of lectures.

00:01:47.646 --> 00:01:48.681
So this is not required,

00:01:48.681 --> 00:01:51.629
but this, I think, will be
a really useful experience

00:01:51.629 --> 00:01:53.190
to gain a little bit more intuition

00:01:53.190 --> 00:01:54.961
into how some of these
hyper parameters work

00:01:54.961 --> 00:01:57.481
for different types of
data sets in practice.

00:01:57.481 --> 00:01:59.339
So we're still working on getting

00:01:59.339 --> 00:02:00.950
all the bugs worked out of this setup,

00:02:00.950 --> 00:02:02.319
and we'll probably send out

00:02:02.319 --> 00:02:03.459
some more instructions

00:02:03.459 --> 00:02:04.499
on exactly how this will work

00:02:04.499 --> 00:02:05.790
in the next couple of days.

00:02:05.790 --> 00:02:07.209
But again, not required.

00:02:07.209 --> 00:02:08.280
But please do check it out.

00:02:08.280 --> 00:02:09.161
I think it'll be really fun

00:02:09.161 --> 00:02:11.008
and a really cool thing
for you to play with.

00:02:11.008 --> 00:02:12.529
And will give you a bit of extra credit

00:02:12.529 --> 00:02:13.779
if you do some,

00:02:14.783 --> 00:02:15.897
if you end up working with this

00:02:15.897 --> 00:02:18.208
and doing a couple of runs with it.

00:02:18.208 --> 00:02:20.041
So, we'll again send out
some more details about this

00:02:20.041 --> 00:02:23.458
soon once we get all the bugs worked out.

00:02:24.720 --> 00:02:25.819
As a reminder,

00:02:25.819 --> 00:02:28.139
last time we were talking
about CNN Architectures.

00:02:28.139 --> 00:02:29.990
We kind of walked through the time line

00:02:29.990 --> 00:02:31.150
of some of the various winners

00:02:31.150 --> 00:02:33.081
of the image net classification challenge,

00:02:33.081 --> 00:02:35.006
kind of the breakthrough result.

00:02:35.006 --> 00:02:37.659
As we saw was the AlexNet
architecture in 2012,

00:02:37.659 --> 00:02:39.979
which was a nine layer
convolutional network.

00:02:39.979 --> 00:02:41.331
It did amazingly well,

00:02:41.331 --> 00:02:42.401
and it sort of kick started

00:02:42.401 --> 00:02:44.921
this whole deep learning
revolution in computer vision,

00:02:44.921 --> 00:02:46.361
and kind of brought a lot of these models

00:02:46.361 --> 00:02:48.081
into the mainstream.

00:02:48.081 --> 00:02:50.038
Then we skipped ahead a couple years,

00:02:50.038 --> 00:02:52.470
and saw that in 2014 image net challenge,

00:02:52.470 --> 00:02:54.278
we had these two really
interesting models,

00:02:54.278 --> 00:02:55.111
VGG and GoogLeNet,

00:02:55.111 --> 00:02:56.699
which were much deeper.

00:02:56.699 --> 00:02:57.532
So VGG was,

00:02:57.532 --> 00:02:59.870
they had a 16 and a 19 layer model,

00:02:59.870 --> 00:03:01.150
and GoogLeNet was, I believe,

00:03:01.150 --> 00:03:02.930
a 22 layer model.

00:03:02.930 --> 00:03:04.750
Although one thing that
is kind of interesting

00:03:04.750 --> 00:03:05.811
about these models

00:03:05.811 --> 00:03:08.121
is that the 2014 image net challenge

00:03:08.121 --> 00:03:11.230
was right before batch
normalization was invented.

00:03:11.230 --> 00:03:12.411
So at this time,

00:03:12.411 --> 00:03:13.979
before the invention
of batch normalization,

00:03:13.979 --> 00:03:15.870
training these relatively deep models

00:03:15.870 --> 00:03:18.761
of roughly twenty layers
was very challenging.

00:03:18.761 --> 00:03:20.510
So, in fact, both of these two models

00:03:20.510 --> 00:03:22.310
had to resort to a little bit of hackery

00:03:22.310 --> 00:03:24.869
in order to get their
deep models to converge.

00:03:24.869 --> 00:03:28.579
So for VGG, they had the
16 and 19 layer models,

00:03:28.579 --> 00:03:31.870
but actually they first
trained an 11 layer model,

00:03:31.870 --> 00:03:34.107
because that was what they
could get to converge.

00:03:34.107 --> 00:03:36.600
And then added some extra
random layers in the middle

00:03:36.600 --> 00:03:37.771
and then continued training,

00:03:37.771 --> 00:03:40.059
actually training the
16 and 19 layer models.

00:03:40.059 --> 00:03:42.361
So, managing this training process

00:03:42.361 --> 00:03:44.058
was very challenging in 2014

00:03:44.058 --> 00:03:46.539
before the invention
of batch normalization.

00:03:46.539 --> 00:03:47.801
Similarly, for GoogLeNet,

00:03:47.801 --> 00:03:50.251
we saw that GoogLeNet has
these auxiliary classifiers

00:03:50.251 --> 00:03:52.539
that were stuck into lower
layers of the network.

00:03:52.539 --> 00:03:54.959
And these were not really
needed for the class to,

00:03:54.959 --> 00:03:56.539
to get good classification performance.

00:03:56.539 --> 00:03:59.390
This was just sort of a way to cause

00:03:59.390 --> 00:04:01.201
extra gradient to be injected

00:04:01.201 --> 00:04:03.430
directly into the lower
layers of the network.

00:04:03.430 --> 00:04:05.315
And this sort of,

00:04:05.315 --> 00:04:08.110
this again was before the
invention of batch normalization

00:04:08.110 --> 00:04:09.430
and now once you have these networks

00:04:09.430 --> 00:04:10.411
with batch normalization,

00:04:10.411 --> 00:04:13.321
then you no longer need
these slightly ugly hacks

00:04:13.321 --> 00:04:17.321
in order to get these
deeper models to converge.

00:04:17.321 --> 00:04:20.801
Then we also saw in the
2015 image net challenge

00:04:20.801 --> 00:04:22.979
was this really cool model called ResNet,

00:04:22.979 --> 00:04:24.350
these residual networks

00:04:24.350 --> 00:04:26.321
that now have these shortcut connections

00:04:26.321 --> 00:04:28.310
that actually have these
little residual blocks

00:04:28.310 --> 00:04:30.721
where we're going to take our input,

00:04:30.721 --> 00:04:32.280
pass it through the residual blocks,

00:04:32.280 --> 00:04:33.989
and then add that output of the,

00:04:33.989 --> 00:04:36.569
then add our input to the block,

00:04:36.569 --> 00:04:39.110
to the output from these
convolutional layers.

00:04:39.110 --> 00:04:40.790
This is kind of a funny architecture,

00:04:40.790 --> 00:04:43.308
but it actually has two
really nice properties.

00:04:43.308 --> 00:04:45.691
One is that if we just set all the weights

00:04:45.691 --> 00:04:47.481
in this residual block to zero,

00:04:47.481 --> 00:04:49.531
then this block is competing the identity.

00:04:49.531 --> 00:04:50.451
So in some way,

00:04:50.451 --> 00:04:52.881
it's relatively easy for this model

00:04:52.881 --> 00:04:55.681
to learn not to use the
layers that it doesn't need.

00:04:55.681 --> 00:04:57.179
In addition, it kind of adds this

00:04:57.179 --> 00:05:00.110
interpretation to L2 regularization

00:05:00.110 --> 00:05:02.171
in the context of these neural networks,

00:05:02.171 --> 00:05:03.771
cause once you put L2 regularization,

00:05:03.771 --> 00:05:04.881
remember, on your,

00:05:04.881 --> 00:05:06.059
on the weights of your network,

00:05:06.059 --> 00:05:08.321
that's going to drive all
the parameters towards zero.

00:05:08.321 --> 00:05:10.211
And maybe your standard
convolutional architecture

00:05:10.211 --> 00:05:11.379
is driving towards zero.

00:05:11.379 --> 00:05:12.739
Maybe it doesn't make sense.

00:05:12.739 --> 00:05:14.481
But in the context of a residual network,

00:05:14.481 --> 00:05:16.561
if you drive all the
parameters towards zero,

00:05:16.561 --> 00:05:18.051
that's kind of encouraging the model

00:05:18.051 --> 00:05:20.510
to not use layers that it doesn't need,

00:05:20.510 --> 00:05:21.891
because it will just drive those,

00:05:21.891 --> 00:05:23.579
the residual blocks towards the identity,

00:05:23.579 --> 00:05:26.310
whether or not needed for classification.

00:05:26.310 --> 00:05:28.538
The other really useful property
of these residual networks

00:05:28.538 --> 00:05:31.371
has to do with the gradient
flow in the backward paths.

00:05:31.371 --> 00:05:32.971
If you remember what happens
at these addition gates

00:05:32.971 --> 00:05:34.361
in the backward pass,

00:05:34.361 --> 00:05:35.451
when upstream gradient is coming in

00:05:35.451 --> 00:05:37.110
through an addition gate,

00:05:37.110 --> 00:05:37.943
then it will split

00:05:37.943 --> 00:05:39.881
and fork along these two different paths.

00:05:39.881 --> 00:05:42.750
So then, when upstream gradient comes in,

00:05:42.750 --> 00:05:46.361
it'll take one path through
these convolutional blocks,

00:05:46.361 --> 00:05:48.611
but it will also have a direct
connection of the gradient

00:05:48.611 --> 00:05:50.811
through this residual connection.

00:05:50.811 --> 00:05:51.920
So then when you look at,

00:05:51.920 --> 00:05:53.211
when you imagine stacking many of these

00:05:53.211 --> 00:05:55.630
residual blocks on top of each other,

00:05:55.630 --> 00:05:57.091
and our network ends up with hundreds of,

00:05:57.091 --> 00:05:59.150
potentially hundreds of layers.

00:05:59.150 --> 00:06:00.578
Then, these residual connections

00:06:00.578 --> 00:06:02.841
give a sort of gradient super highway

00:06:02.841 --> 00:06:04.361
for gradients to flow backward

00:06:04.361 --> 00:06:05.561
through the entire network.

00:06:05.561 --> 00:06:08.299
And this allows it to train much easier

00:06:08.299 --> 00:06:09.630
and much faster.

00:06:09.630 --> 00:06:11.499
And actually allows
these things to converge

00:06:11.499 --> 00:06:12.459
reasonably well,

00:06:12.459 --> 00:06:15.738
even when the model is potentially
hundreds of layers deep.

00:06:15.738 --> 00:06:19.122
And this idea of managing
gradient flow in your models

00:06:19.122 --> 00:06:21.550
is actually super important
everywhere in machine learning.

00:06:21.550 --> 00:06:23.880
And super prevalent in
recurrent networks as well.

00:06:23.880 --> 00:06:26.481
So we'll definitely revisit
this idea of gradient flow

00:06:26.481 --> 00:06:28.564
later in today's lecture.

00:06:31.148 --> 00:06:34.377
So then, we kind of also saw
a couple other more exotic,

00:06:34.377 --> 00:06:36.131
more recent CNN architectures last time,

00:06:36.131 --> 00:06:38.068
including DenseNet and FractalNet,

00:06:38.068 --> 00:06:39.771
and once you think about
these architectures

00:06:39.771 --> 00:06:41.321
in terms of gradient flow,

00:06:41.321 --> 00:06:43.070
they make a little bit more sense.

00:06:43.070 --> 00:06:45.019
These things like DenseNet and FractalNet

00:06:45.019 --> 00:06:46.761
are adding these additional shortcut

00:06:46.761 --> 00:06:48.619
or identity connections inside the model.

00:06:48.619 --> 00:06:50.241
And if you think about what happens

00:06:50.241 --> 00:06:52.011
in the backwards pass for these models,

00:06:52.011 --> 00:06:53.411
these additional funny topologies

00:06:53.411 --> 00:06:55.251
are basically providing direct paths

00:06:55.251 --> 00:06:56.550
for gradients to flow

00:06:56.550 --> 00:06:58.139
from the loss at the end of the network

00:06:58.139 --> 00:07:00.571
more easily into all the
different layers of the network.

00:07:00.571 --> 00:07:01.870
So I think that,

00:07:01.870 --> 00:07:04.491
again, this idea of managing
gradient flow properly

00:07:04.491 --> 00:07:06.579
in your CNN Architectures

00:07:06.579 --> 00:07:07.771
is something that we've really seen

00:07:07.771 --> 00:07:09.760
a lot more in the last couple of years.

00:07:09.760 --> 00:07:11.721
And will probably see more moving forward

00:07:11.721 --> 00:07:15.221
as more exotic architectures are invented.

00:07:16.257 --> 00:07:18.189
We also saw this kind of nice plot,

00:07:18.189 --> 00:07:19.939
plotting performance of

00:07:19.939 --> 00:07:22.081
the number of flops versus
the number of parameters

00:07:22.081 --> 00:07:24.331
versus the run time of
these various models.

00:07:24.331 --> 00:07:25.790
And there's some
interesting characteristics

00:07:25.790 --> 00:07:27.971
that you can dive in
and see from this plot.

00:07:27.971 --> 00:07:31.110
One idea is that VGG and AlexNet

00:07:31.110 --> 00:07:32.801
have a huge number of parameters,

00:07:32.801 --> 00:07:34.291
and these parameters actually come

00:07:34.291 --> 00:07:35.961
almost entirely from the
fully connected layers

00:07:35.961 --> 00:07:37.119
of the models.

00:07:37.119 --> 00:07:39.959
So AlexNet has something like
roughly 62 million parameters,

00:07:39.959 --> 00:07:42.470
and if you look at that
last fully connected layer,

00:07:42.470 --> 00:07:44.761
the final fully connected layer in AlexNet

00:07:44.761 --> 00:07:47.771
is going from an activation
volume of six by six by 256

00:07:47.771 --> 00:07:51.190
into this fully connected vector of 496.

00:07:51.190 --> 00:07:53.310
So if you imagine what the weight matrix

00:07:53.310 --> 00:07:54.851
needs to look like at that layer,

00:07:54.851 --> 00:07:56.851
the weight matrix is gigantic.

00:07:56.851 --> 00:07:58.630
It's number of entries is six by six,

00:07:58.630 --> 00:08:01.921
six times six times 256 times 496.

00:08:01.921 --> 00:08:03.390
And if you multiply that out,

00:08:03.390 --> 00:08:04.491
you see that that single layer

00:08:04.491 --> 00:08:06.370
has 38 million parameters.

00:08:06.370 --> 00:08:07.977
So more than half of the parameters

00:08:07.977 --> 00:08:09.219
of the entire AlexNet model

00:08:09.219 --> 00:08:11.859
are just sitting in that
last fully connected layer.

00:08:11.859 --> 00:08:13.339
And if you add up all the parameters

00:08:13.339 --> 00:08:15.971
in just the fully connected
layers of AlexNet,

00:08:15.971 --> 00:08:17.739
including these other
fully connected layers,

00:08:17.739 --> 00:08:20.721
you see something like
59 of the 62 million

00:08:20.721 --> 00:08:21.841
parameters in AlexNet

00:08:21.841 --> 00:08:24.241
are sitting in these
fully connected layers.

00:08:24.241 --> 00:08:26.439
So then when we move other architectures,

00:08:26.439 --> 00:08:27.601
like GoogLeNet and ResNet,

00:08:27.601 --> 00:08:29.739
they do away with a lot of these large

00:08:29.739 --> 00:08:31.110
fully connected layers

00:08:31.110 --> 00:08:32.611
in favor of global average pooling

00:08:32.611 --> 00:08:33.698
at the end of the network.

00:08:33.698 --> 00:08:35.200
And this allows these
networks to really cut,

00:08:35.201 --> 00:08:36.971
these nicer architectures,

00:08:36.971 --> 00:08:39.019
to really cut down the parameter count

00:08:39.019 --> 00:08:40.936
in these architectures.

00:08:44.463 --> 00:08:46.641
So that was kind of our brief recap

00:08:46.641 --> 00:08:48.771
of the CNN architectures
that we saw last lecture,

00:08:48.771 --> 00:08:49.604
and then today,

00:08:49.604 --> 00:08:50.462
we're going to move to

00:08:50.462 --> 00:08:52.891
one of my favorite topics to talk about,

00:08:52.891 --> 00:08:56.321
which is recurrent neural networks.

00:08:56.321 --> 00:08:57.502
So, so far in this class,

00:08:57.502 --> 00:08:59.342
we've seen, what I like to think of

00:08:59.342 --> 00:09:01.759
as kind of a vanilla feed forward network,

00:09:01.759 --> 00:09:03.222
all of our network
architectures have this flavor,

00:09:03.222 --> 00:09:05.160
where we receive some input

00:09:05.160 --> 00:09:07.353
and that input is a fixed size object,

00:09:07.353 --> 00:09:08.593
like an image or vector.

00:09:08.593 --> 00:09:11.691
That input is fed through
some set of hidden layers

00:09:11.691 --> 00:09:13.850
and produces a single output,

00:09:13.850 --> 00:09:14.851
like a classifications,

00:09:14.851 --> 00:09:16.793
like a set of classifications scores

00:09:16.793 --> 00:09:18.876
over a set of categories.

00:09:20.071 --> 00:09:22.193
But in some context in machine learning,

00:09:22.193 --> 00:09:23.921
we want to have more flexibility

00:09:23.921 --> 00:09:25.942
in the types of data that
our models can process.

00:09:25.942 --> 00:09:29.193
So once we move to this idea
of recurrent neural networks,

00:09:29.193 --> 00:09:31.121
we have a lot more opportunities

00:09:31.121 --> 00:09:33.571
to play around with the types
of input and output data

00:09:33.571 --> 00:09:35.313
that our networks can handle.

00:09:35.313 --> 00:09:37.329
So once we have recurrent neural networks,

00:09:37.329 --> 00:09:41.009
we can do what we call
these one to many models.

00:09:41.009 --> 00:09:44.161
Or where maybe our input is
some object of fixed size,

00:09:44.161 --> 00:09:45.032
like an image,

00:09:45.032 --> 00:09:47.500
but now our output is a
sequence of variable length,

00:09:47.500 --> 00:09:48.721
such as a caption.

00:09:48.721 --> 00:09:49.801
Where different captions

00:09:49.801 --> 00:09:51.433
might have different numbers of words,

00:09:51.433 --> 00:09:54.081
so our output needs to
be variable in length.

00:09:54.081 --> 00:09:56.491
We also might have many to one models,

00:09:56.491 --> 00:09:58.622
where our input could be variably sized.

00:09:58.622 --> 00:10:01.001
This might be something
like a piece of text,

00:10:01.001 --> 00:10:03.782
and we want to say what is
the sentiment of that text,

00:10:03.782 --> 00:10:06.161
whether it's positive or
negative in sentiment.

00:10:06.161 --> 00:10:07.771
Or in a computer vision context,

00:10:07.771 --> 00:10:09.401
you might imagine taking as input a video,

00:10:09.401 --> 00:10:12.512
and that video might have a
variable number of frames.

00:10:12.512 --> 00:10:14.921
And now we want to read this entire video

00:10:14.921 --> 00:10:16.401
of potentially variable length.

00:10:16.401 --> 00:10:17.439
And then at the end,

00:10:17.439 --> 00:10:18.742
make a classification decision

00:10:18.742 --> 00:10:20.142
about maybe what kind
of activity or action

00:10:20.142 --> 00:10:22.721
is going on in that video.

00:10:22.721 --> 00:10:26.702
We also have a, we
might also have problems

00:10:26.702 --> 00:10:28.091
where we want both the inputs

00:10:28.091 --> 00:10:29.931
and the output to be variable in length.

00:10:29.931 --> 00:10:31.491
We might see something like this

00:10:31.491 --> 00:10:33.433
in machine translation,

00:10:33.433 --> 00:10:35.870
where our input is some,
maybe, sentence in English,

00:10:35.870 --> 00:10:37.302
which could have a variable length,

00:10:37.302 --> 00:10:39.393
and our output is maybe
some sentence in French,

00:10:39.393 --> 00:10:41.633
which also could have a variable length.

00:10:41.633 --> 00:10:44.032
And crucially, the length
of the English sentence

00:10:44.032 --> 00:10:46.801
might be different from the
length of the French sentence.

00:10:46.801 --> 00:10:48.710
So we need some models
that have the capacity

00:10:48.710 --> 00:10:50.782
to accept both variable length sequences

00:10:50.782 --> 00:10:53.931
on the input and on the output.

00:10:53.931 --> 00:10:56.331
Finally, we might also
consider problems where

00:10:56.331 --> 00:10:58.153
our input is variably length,

00:10:58.153 --> 00:10:59.891
like something like a video sequence

00:10:59.891 --> 00:11:01.553
with a variable number of frames.

00:11:01.553 --> 00:11:02.920
And now we want to make a decision

00:11:02.920 --> 00:11:04.771
for each element of that input sequence.

00:11:04.771 --> 00:11:06.721
So in the context of videos,

00:11:06.721 --> 00:11:09.201
that might be making some
classification decision

00:11:09.201 --> 00:11:11.891
along every frame of the video.

00:11:11.891 --> 00:11:13.281
And recurrent neural networks

00:11:13.281 --> 00:11:15.171
are this kind of general paradigm

00:11:15.171 --> 00:11:17.401
for handling variable sized sequence data

00:11:17.401 --> 00:11:19.302
that allow us to pretty naturally capture

00:11:19.302 --> 00:11:23.469
all of these different types
of setups in our models.

00:11:24.349 --> 00:11:26.662
So recurring neural networks
are actually important,

00:11:26.662 --> 00:11:30.030
even for some problems that
have a fixed size input

00:11:30.030 --> 00:11:31.414
and a fixed size output.

00:11:31.414 --> 00:11:33.752
Recurrent neural networks
can still be pretty useful.

00:11:33.752 --> 00:11:34.763
So in this example,

00:11:34.763 --> 00:11:35.982
we might want to do, for example,

00:11:35.982 --> 00:11:38.793
sequential processing of our input.

00:11:38.793 --> 00:11:40.721
So here, we're receiving
a fixed size input

00:11:40.721 --> 00:11:41.774
like an image,

00:11:41.774 --> 00:11:43.732
and we want to make a
classification decision

00:11:43.732 --> 00:11:46.227
about, like, what number is
being shown in this image?

00:11:46.227 --> 00:11:48.854
But now, rather than just doing
a single feed forward pass

00:11:48.854 --> 00:11:50.393
and making the decision all at once,

00:11:50.393 --> 00:11:52.702
this network is actually
looking around the image

00:11:52.702 --> 00:11:55.553
and taking various glimpses of
different parts of the image.

00:11:55.553 --> 00:11:58.233
And then after making
some series of glimpses,

00:11:58.233 --> 00:11:59.882
then it makes its final decision

00:11:59.882 --> 00:12:01.742
as to what kind of number is present.

00:12:01.742 --> 00:12:03.233
So here, we had one,

00:12:03.233 --> 00:12:05.462
So here, even though
our input and outputs,

00:12:05.462 --> 00:12:06.622
our input was an image,

00:12:06.622 --> 00:12:09.054
and our output was a
classification decision,

00:12:09.054 --> 00:12:10.243
even this context,

00:12:10.243 --> 00:12:11.761
this idea of being able to handle

00:12:11.761 --> 00:12:14.121
variably length processing
with recurrent neural networks

00:12:14.121 --> 00:12:17.473
can lead to some really
interesting types of models.

00:12:17.473 --> 00:12:20.273
There's a really cool paper that I like

00:12:20.273 --> 00:12:22.140
that applied this same type of idea

00:12:22.140 --> 00:12:23.923
to generating new images.

00:12:23.923 --> 00:12:27.083
Where now, we want the model
to synthesize brand new images

00:12:27.083 --> 00:12:29.723
that look kind of like the
images it saw in training,

00:12:29.723 --> 00:12:32.489
and we can use a recurrent
neural network architecture

00:12:32.489 --> 00:12:34.222
to actually paint these output images

00:12:34.222 --> 00:12:36.254
sort of one piece at a time in the output.

00:12:36.254 --> 00:12:37.342
You can see that,

00:12:37.342 --> 00:12:40.260
even though our output
is this fixed size image,

00:12:40.260 --> 00:12:42.942
we can have these models
that are working over time

00:12:42.942 --> 00:12:46.380
to compute parts of the output
one at a time sequentially.

00:12:46.380 --> 00:12:48.577
And we can use recurrent neural networds

00:12:48.577 --> 00:12:51.662
for that type of setup as well.

00:12:51.662 --> 00:12:54.054
So after this sort of cool pitch

00:12:54.054 --> 00:12:55.854
about all these cool
things that RNNs can do,

00:12:55.854 --> 00:12:58.785
you might wonder, like what
exactly are these things?

00:12:58.785 --> 00:13:00.353
So in general, a recurrent neural network

00:13:00.353 --> 00:13:04.163
is this little, has this
little recurrent core cell

00:13:04.163 --> 00:13:06.272
and it will take some input x,

00:13:06.272 --> 00:13:08.734
feed that input into the RNN,

00:13:08.734 --> 00:13:11.382
and that RNN has some
internal hidden state,

00:13:11.382 --> 00:13:14.198
and that internal hidden
state will be updated

00:13:14.198 --> 00:13:17.641
every time that the RNN reads a new input.

00:13:17.641 --> 00:13:19.654
And that internal hidden state

00:13:19.654 --> 00:13:21.243
will be then fed back to the model

00:13:21.243 --> 00:13:23.980
the next time it reads an input.

00:13:23.980 --> 00:13:26.334
And frequently, we will want our RNN"s

00:13:26.334 --> 00:13:28.822
to also produce some
output at every time step,

00:13:28.822 --> 00:13:31.043
so we'll have this pattern
where it will read an input,

00:13:31.043 --> 00:13:32.219
update its hidden state,

00:13:32.219 --> 00:13:34.469
and then produce an output.

00:13:35.814 --> 00:13:37.213
So then the question is

00:13:37.213 --> 00:13:39.862
what is the functional form
of this recurrence relation

00:13:39.862 --> 00:13:40.961
that we're computing?

00:13:40.961 --> 00:13:42.923
So inside this little green RNN block,

00:13:42.923 --> 00:13:45.062
we're computing some recurrence relation,

00:13:45.062 --> 00:13:46.443
with a function f.

00:13:46.443 --> 00:13:49.094
So this function f will
depend on some weights, w.

00:13:49.094 --> 00:13:52.822
It will accept the previous
hidden state, h t - 1,

00:13:52.822 --> 00:13:55.374
as well as the input at
the current state, x t,

00:13:55.374 --> 00:13:56.683
and this will output

00:13:56.683 --> 00:13:59.782
the next hidden state, or
the updated hidden state,

00:13:59.782 --> 00:14:01.420
that we call h t.

00:14:01.420 --> 00:14:02.253
And now,

00:14:02.253 --> 00:14:04.382
then as we read the next input,

00:14:04.382 --> 00:14:05.283
this hidden state,

00:14:05.283 --> 00:14:06.894
this new hidden state, h t,

00:14:06.894 --> 00:14:08.774
will then just be passed
into the same function

00:14:08.774 --> 00:14:11.552
as we read the next input, x t plus one.

00:14:11.552 --> 00:14:13.894
And now, if we wanted
to produce some output

00:14:13.894 --> 00:14:15.273
at every time step of this network,

00:14:15.273 --> 00:14:20.203
we might attach some additional
fully connected layers

00:14:20.203 --> 00:14:21.797
that read in this h t at every time step.

00:14:21.797 --> 00:14:23.407
And make that decision based

00:14:23.407 --> 00:14:27.327
on the hidden state at every time step.

00:14:27.327 --> 00:14:29.018
And one thing to note is that

00:14:29.018 --> 00:14:31.087
we use the same function, f w,

00:14:31.087 --> 00:14:32.495
and the same weights, w,

00:14:32.495 --> 00:14:35.662
at every time step of the computation.

00:14:36.921 --> 00:14:39.706
So then kind of the simplest function form

00:14:39.706 --> 00:14:40.634
that you can imagine

00:14:40.634 --> 00:14:43.434
is what we call this vanilla
recurrent neural network.

00:14:43.434 --> 00:14:45.826
So here, we have this same functional form

00:14:45.826 --> 00:14:46.866
from the previous slide,

00:14:46.866 --> 00:14:49.191
where we're taking in
our previous hidden state

00:14:49.191 --> 00:14:50.145
and our current input

00:14:50.145 --> 00:14:52.483
and we need to produce
the next hidden state.

00:14:52.483 --> 00:14:54.954
And the kind of simplest
thing you might imagine

00:14:54.954 --> 00:14:57.724
is that we have some weight matrix, w x h,

00:14:57.724 --> 00:15:00.124
that we multiply against the input, x t,

00:15:00.124 --> 00:15:03.162
as well as another weight matrix, w h h,

00:15:03.162 --> 00:15:05.615
that we multiply against
the previous hidden state.

00:15:05.615 --> 00:15:07.226
So we make these two multiplications

00:15:07.226 --> 00:15:08.494
against our two states,

00:15:08.494 --> 00:15:09.327
add them together,

00:15:09.327 --> 00:15:10.924
and squash them through a tanh,

00:15:10.924 --> 00:15:13.514
so we get some kind of non
linearity in the system.

00:15:13.514 --> 00:15:15.287
You might be wondering
why we use a tanh here

00:15:15.287 --> 00:15:17.312
and not some other type of non-linearity?

00:15:17.312 --> 00:15:18.824
After all that we've said negative

00:15:18.824 --> 00:15:20.594
about tanh's in previous lectures,

00:15:20.594 --> 00:15:22.424
and I think we'll return
a little bit to that

00:15:22.424 --> 00:15:23.257
later on when we talk about

00:15:23.257 --> 00:15:26.507
more advanced architectures, like lstm.

00:15:27.346 --> 00:15:28.695
So then, this,

00:15:28.695 --> 00:15:30.872
So then, in addition in this architecture,

00:15:30.872 --> 00:15:33.394
if we wanted to produce
some y t at every time step,

00:15:33.394 --> 00:15:35.284
you might have another weight matrix, w,

00:15:35.284 --> 00:15:38.055
you might have another weight matrix

00:15:38.055 --> 00:15:39.055
that accepts this hidden state

00:15:39.055 --> 00:15:40.375
and then transforms it to some y

00:15:40.375 --> 00:15:42.724
to produce maybe some
class score predictions

00:15:42.724 --> 00:15:44.826
at every time step.

00:15:44.826 --> 00:15:46.723
And when I think about
recurrent neural networks,

00:15:46.723 --> 00:15:48.775
I kind of think about, you can also,

00:15:48.775 --> 00:15:50.634
you can kind of think of
recurrent neural networks

00:15:50.634 --> 00:15:51.487
in two ways.

00:15:51.487 --> 00:15:53.713
One is this concept of
having a hidden state

00:15:53.713 --> 00:15:57.095
that feeds back at itself, recurrently.

00:15:57.095 --> 00:15:59.135
But I find that picture
a little bit confusing.

00:15:59.135 --> 00:16:01.073
And sometimes, I find it clearer

00:16:01.073 --> 00:16:04.546
to think about unrolling
this computational graph

00:16:04.546 --> 00:16:05.914
for multiple time steps.

00:16:05.914 --> 00:16:08.382
And this makes the data
flow of the hidden states

00:16:08.382 --> 00:16:09.914
and the inputs and the
outputs and the weights

00:16:09.914 --> 00:16:11.786
maybe a little bit more clear.

00:16:11.786 --> 00:16:13.095
So then at the first time step,

00:16:13.095 --> 00:16:15.494
we'll have some initial
hidden state h zero.

00:16:15.494 --> 00:16:18.914
This is usually initialized
to zeros for most context,

00:16:18.914 --> 00:16:22.415
in most contexts, an then
we'll have some input, x t.

00:16:22.415 --> 00:16:25.084
This initial hidden state, h zero,

00:16:25.084 --> 00:16:26.906
and our current input, x t,

00:16:26.906 --> 00:16:28.324
will go into our f w function.

00:16:28.324 --> 00:16:32.604
This will produce our
next hidden state, h one.

00:16:32.604 --> 00:16:34.034
And then, we'll repeat this process

00:16:34.034 --> 00:16:36.154
when we receive the next input.

00:16:36.154 --> 00:16:38.295
So now our current h one and our x one,

00:16:38.295 --> 00:16:40.036
will go into that same f w,

00:16:40.036 --> 00:16:42.847
to produce our next output, h two.

00:16:42.847 --> 00:16:45.866
And this process will
repeat over and over again,

00:16:45.866 --> 00:16:48.365
as we consume all of the input, x ts,

00:16:48.365 --> 00:16:50.866
in our sequence of inputs.

00:16:50.866 --> 00:16:52.116
And now, one thing to note, is that

00:16:52.116 --> 00:16:54.756
we can actually make
this even more explicit

00:16:54.756 --> 00:16:58.036
and write the w matrix in
our computational graph.

00:16:58.036 --> 00:16:59.058
And here you can see that

00:16:59.058 --> 00:17:01.434
we're re-using the same w matrix

00:17:01.434 --> 00:17:03.415
at every time step of the computation.

00:17:03.415 --> 00:17:06.145
So now every time that we
have this little f w block,

00:17:06.145 --> 00:17:08.724
it's receiving a unique h and a unique x,

00:17:08.724 --> 00:17:11.006
but all of these blocks
are taking the same w.

00:17:11.007 --> 00:17:12.236
And if you remember,

00:17:12.236 --> 00:17:16.116
we talked about how gradient
flows in back propagation,

00:17:16.116 --> 00:17:17.058
when you re-use the same,

00:17:17.058 --> 00:17:19.218
when you re-use the
same node multiple times

00:17:19.218 --> 00:17:20.786
in a computational graph,

00:17:20.786 --> 00:17:22.257
then remember during the backward pass,

00:17:22.257 --> 00:17:24.047
you end up summing the gradients

00:17:24.047 --> 00:17:25.257
into the w matrix

00:17:25.257 --> 00:17:28.218
when you're computing a d los d w.

00:17:28.218 --> 00:17:30.557
So, if you kind of think about

00:17:30.557 --> 00:17:32.526
the back propagation for this model,

00:17:32.526 --> 00:17:34.555
then you'll have a separate gradient for w

00:17:34.555 --> 00:17:36.506
flowing from each of those time steps,

00:17:36.506 --> 00:17:38.276
and then the final gradient for w

00:17:38.276 --> 00:17:39.586
will be the sum of all of those

00:17:39.586 --> 00:17:42.503
individual per time step gradiants.

00:17:43.615 --> 00:17:46.073
We can also write to this y t explicitly

00:17:46.073 --> 00:17:47.727
in this computational graph.

00:17:47.727 --> 00:17:50.668
So then, this output,
h t, at every time step

00:17:50.668 --> 00:17:52.967
might feed into some other
little neural network

00:17:52.967 --> 00:17:54.858
that can produce a y t,

00:17:54.858 --> 00:17:57.256
which might be some class
scores, or something like that,

00:17:57.256 --> 00:17:59.087
at every time step.

00:17:59.087 --> 00:18:00.738
We can also make the loss more explicit.

00:18:00.738 --> 00:18:03.167
So in many cases, you
might imagine producing,

00:18:03.167 --> 00:18:05.778
you might imagine that you
have some ground truth label

00:18:05.778 --> 00:18:07.588
at every time step of your sequence,

00:18:07.588 --> 00:18:08.978
and then you'll compute some loss,

00:18:08.978 --> 00:18:10.487
some individual loss,

00:18:10.487 --> 00:18:11.596
at every time step of

00:18:11.596 --> 00:18:14.068
these outputs, y t's.

00:18:14.068 --> 00:18:14.915
And this loss might,

00:18:14.915 --> 00:18:17.556
it will frequently be
something like soft max loss,

00:18:17.556 --> 00:18:19.497
in the case where you have, maybe,

00:18:19.497 --> 00:18:22.497
a ground truth label at every
time step of the sequence.

00:18:22.497 --> 00:18:24.076
And now the final loss for the entire,

00:18:24.076 --> 00:18:25.395
for this entire training stop,

00:18:25.395 --> 00:18:27.887
will be the sum of
these individual losses.

00:18:27.887 --> 00:18:30.485
So now, we had a scaler
loss at every time step?

00:18:30.485 --> 00:18:32.818
And we just summed them
up to get our final

00:18:32.818 --> 00:18:34.196
scaler loss at the top of the network.

00:18:34.196 --> 00:18:36.207
And now, if you think about,

00:18:36.207 --> 00:18:37.898
again, back propagation
through this thing,

00:18:37.898 --> 00:18:38.896
we need, in order to train the model,

00:18:38.896 --> 00:18:40.215
we need to compute the gradient

00:18:40.215 --> 00:18:42.098
of the loss with respect to w.

00:18:42.098 --> 00:18:44.498
So, we'll have loss flowing
from that final loss

00:18:44.498 --> 00:18:46.178
into each of these time steps.

00:18:46.178 --> 00:18:47.708
And then each of those time steps

00:18:47.708 --> 00:18:49.840
will compute a local
gradient on the weights, w,

00:18:49.840 --> 00:18:52.010
which will all then be
summed to give us our final

00:18:52.010 --> 00:18:54.343
gradient for the weights, w.

00:18:55.597 --> 00:18:58.268
Now if we have a, sort of,
this many to one situation,

00:18:58.268 --> 00:19:01.188
where maybe we want to do
something like sentiment analysis,

00:19:01.188 --> 00:19:03.348
then we would typically make that decision

00:19:03.348 --> 00:19:05.799
based on the final hidden
state of this network.

00:19:05.799 --> 00:19:07.450
Because this final hidden state

00:19:07.450 --> 00:19:08.988
kind of summarizes all of the context

00:19:08.988 --> 00:19:11.868
from the entire sequence.

00:19:11.868 --> 00:19:14.788
Also, if we have a kind of
a one to many situation,

00:19:14.788 --> 00:19:17.319
where we want to receive a fix sized input

00:19:17.319 --> 00:19:19.319
and then produce a variably sized output.

00:19:19.319 --> 00:19:22.698
Then you'll commonly use
that fixed size input

00:19:22.698 --> 00:19:23.988
to initialize, somehow,

00:19:23.988 --> 00:19:26.050
the initial hidden state of the model,

00:19:26.050 --> 00:19:27.986
and now the recurrent network will tick

00:19:27.986 --> 00:19:30.079
for each cell in the output.

00:19:30.079 --> 00:19:32.748
And now, as you produce
your variably sized output,

00:19:32.748 --> 00:19:36.915
you'll unroll the graph for
each element in the output.

00:19:38.490 --> 00:19:41.370
So this, when we talk about
the sequence to sequence models

00:19:41.370 --> 00:19:44.308
where you might do something
like machine translation,

00:19:44.308 --> 00:19:46.099
where you take a variably sized input

00:19:46.099 --> 00:19:47.648
and a variably sized output.

00:19:47.648 --> 00:19:49.300
You can think of this as a combination

00:19:49.300 --> 00:19:50.719
of the many to one,

00:19:50.719 --> 00:19:52.398
plus a one to many.

00:19:52.398 --> 00:19:54.889
So, we'll kind of proceed in two stages,

00:19:54.889 --> 00:19:56.900
what we call an encoder and a decoder.

00:19:56.900 --> 00:19:58.119
So if you're the encoder,

00:19:58.119 --> 00:20:00.119
we'll receive the variably sized input,

00:20:00.119 --> 00:20:02.159
which might be your sentence in English,

00:20:02.159 --> 00:20:04.311
and then summarize that entire sentence

00:20:04.311 --> 00:20:08.110
using the final hidden state
of the encoder network.

00:20:08.110 --> 00:20:11.894
And now we're in this
many to one situation

00:20:11.894 --> 00:20:14.396
where we've summarized this
entire variably sized input

00:20:14.396 --> 00:20:15.769
in this single vector,

00:20:15.769 --> 00:20:17.449
and now, we have a second decoder network,

00:20:17.449 --> 00:20:19.409
which is a one to many situation,

00:20:19.409 --> 00:20:21.740
which will input that single vector

00:20:21.740 --> 00:20:23.111
summarizing the input sentence

00:20:23.111 --> 00:20:25.487
and now produce this
variably sized output,

00:20:25.487 --> 00:20:28.969
which might be your sentence
in another language.

00:20:28.969 --> 00:20:30.980
And now in this variably sized output,

00:20:30.980 --> 00:20:32.820
we might make some predictions
at every time step,

00:20:32.820 --> 00:20:34.609
maybe about what word to use.

00:20:34.609 --> 00:20:36.631
And you can imagine kind of
training this entire thing

00:20:36.631 --> 00:20:38.199
by unrolling this computational graph

00:20:38.199 --> 00:20:40.608
summing the losses at the output sequence

00:20:40.608 --> 00:20:44.692
and just performing back
propagation, as usual.

00:20:44.692 --> 00:20:46.391
So as a bit of a concrete example,

00:20:46.391 --> 00:20:47.729
one thing that we frequently use

00:20:47.729 --> 00:20:48.961
recurrent neural networks for,

00:20:48.961 --> 00:20:50.940
is this problem called language modeling.

00:20:50.940 --> 00:20:52.990
So in the language modeling problem,

00:20:52.990 --> 00:20:55.060
we want to read some sequence of,

00:20:55.060 --> 00:20:58.151
we want to have our
network, sort of, understand

00:20:58.151 --> 00:21:00.908
how to produce natural language.

00:21:00.908 --> 00:21:04.071
So in the, so this, this might
happen at the character level

00:21:04.071 --> 00:21:06.601
where our model will produce
characters one at a time.

00:21:06.601 --> 00:21:08.389
This might also happen at the word level

00:21:08.389 --> 00:21:10.769
where our model will
produce words one at a time.

00:21:10.769 --> 00:21:12.129
But in a very simple example,

00:21:12.129 --> 00:21:14.740
you can imagine this
character level language model

00:21:14.740 --> 00:21:15.980
where we want,

00:21:15.980 --> 00:21:18.081
where the network will read
some sequence of characters

00:21:18.081 --> 00:21:19.388
and then it needs to predict,

00:21:19.388 --> 00:21:22.780
what will the next character
be in this stream of text?

00:21:22.780 --> 00:21:24.700
So in this example,

00:21:24.700 --> 00:21:27.460
we have this very small
vocabulary of four letters,

00:21:27.460 --> 00:21:31.213
h, e, l, and o, and we have
this example training sequence

00:21:31.213 --> 00:21:33.884
of the word hello, h, e, l, l, o.

00:21:33.884 --> 00:21:34.911
So during training,

00:21:34.911 --> 00:21:37.586
when we're training this language model,

00:21:37.586 --> 00:21:42.100
we will feed the characters
of this training sequence

00:21:42.100 --> 00:21:46.119
as inputs, as x ts, to out input of our,

00:21:46.119 --> 00:21:49.689
we'll feed the characters
of our training sequence,

00:21:49.689 --> 00:21:52.191
these will be the x ts that
we feed in as the inputs

00:21:52.191 --> 00:21:53.980
to our recurrent neural network.

00:21:53.980 --> 00:21:56.168
And then, each of these inputs,

00:21:56.168 --> 00:21:57.678
it's a letter,

00:21:57.678 --> 00:21:58.890
and we need to figure out a way

00:21:58.890 --> 00:22:01.039
to represent letters in our network.

00:22:01.039 --> 00:22:02.759
So what we'll typically do is figure out

00:22:02.759 --> 00:22:05.100
what is our total vocabulary.

00:22:05.100 --> 00:22:07.460
In this case, our vocabulary
has four elements.

00:22:07.460 --> 00:22:09.967
And each letter will be
represented by a vector

00:22:09.967 --> 00:22:12.589
that has zeros in every slot but one,

00:22:12.589 --> 00:22:15.031
and a one for the slot in the vocabulary

00:22:15.031 --> 00:22:16.628
corresponding to that letter.

00:22:16.628 --> 00:22:18.092
In this little example,

00:22:18.092 --> 00:22:20.751
since our vocab has the
four letters, h, e, l, o,

00:22:20.751 --> 00:22:22.324
then our input sequence,

00:22:22.324 --> 00:22:25.374
the h is represented by
a four element vector

00:22:25.374 --> 00:22:26.535
with a one in the first slot

00:22:26.535 --> 00:22:28.684
and zero's in the other three slots.

00:22:28.684 --> 00:22:29.876
And we use the same sort of pattern

00:22:29.876 --> 00:22:31.306
to represent all the different letters

00:22:31.306 --> 00:22:33.139
in the input sequence.

00:22:34.914 --> 00:22:37.021
Now, during this forward pass

00:22:37.021 --> 00:22:38.575
of what this network is doing,

00:22:38.575 --> 00:22:39.965
at the first time step,

00:22:39.965 --> 00:22:41.874
it will receive the input letter h.

00:22:41.874 --> 00:22:45.285
That will go into the first RNN,

00:22:45.285 --> 00:22:46.714
to the RNN cell,

00:22:46.714 --> 00:22:48.594
and then we'll produce this output, y t,

00:22:48.594 --> 00:22:50.285
which is the network making predictions

00:22:50.285 --> 00:22:52.525
about for each letter in the vocabulary,

00:22:52.525 --> 00:22:54.411
which letter does it think is most likely

00:22:54.411 --> 00:22:56.024
going to come next.

00:22:56.024 --> 00:22:57.245
In this example,

00:22:57.245 --> 00:22:59.330
the correct output letter was e

00:22:59.330 --> 00:23:01.405
because our training sequence was hello,

00:23:01.405 --> 00:23:04.488
but the model is actually predicting,

00:23:05.477 --> 00:23:06.310
I think it's actually

00:23:06.310 --> 00:23:07.850
predicting o as the most likely letter.

00:23:07.850 --> 00:23:09.690
So in this case, this prediction was wrong

00:23:09.690 --> 00:23:11.210
and we would use softmaxt loss

00:23:11.210 --> 00:23:13.889
to quantify our unhappiness
with these predictions.

00:23:13.889 --> 00:23:15.192
The next time step,

00:23:15.192 --> 00:23:16.680
we would feed in the second letter

00:23:16.680 --> 00:23:18.031
in the training sequence, e,

00:23:18.031 --> 00:23:19.741
and this process will repeat.

00:23:19.741 --> 00:23:22.181
We'll now represent e as a vector.

00:23:22.181 --> 00:23:24.232
Use that input vector together

00:23:24.232 --> 00:23:25.649
with the previous hidden state

00:23:25.649 --> 00:23:27.271
to produce a new hidden state

00:23:27.271 --> 00:23:28.792
and now use the second hidden state

00:23:28.792 --> 00:23:30.032
to, again, make predictions

00:23:30.032 --> 00:23:31.912
over every letter in the vocabulary.

00:23:31.912 --> 00:23:34.232
In this case, because our
training sequence was hello,

00:23:34.232 --> 00:23:35.712
after the letter e,

00:23:35.712 --> 00:23:36.810
we want our model to predict l.

00:23:36.810 --> 00:23:37.893
In this case,

00:23:40.183 --> 00:23:41.832
our model may have very low predictions

00:23:41.832 --> 00:23:44.244
for the letter l, so we
would incur high loss.

00:23:44.244 --> 00:23:46.794
And you kind of repeat
this process over and over,

00:23:46.794 --> 00:23:50.343
and if you train this model
with many different sequences,

00:23:50.343 --> 00:23:52.023
then eventually it should learn

00:23:52.023 --> 00:23:54.303
how to predict the next
character in a sequence

00:23:54.303 --> 00:23:56.763
based on the context of
all the previous characters

00:23:56.763 --> 00:23:58.596
that it's seen before.

00:23:59.893 --> 00:24:01.655
And now, if you think about
what happens at test time,

00:24:01.655 --> 00:24:03.388
after we train this model,

00:24:03.388 --> 00:24:05.033
one thing that we might want to do with it

00:24:05.033 --> 00:24:07.594
is a sample from the model,

00:24:07.594 --> 00:24:09.673
and actually use this
trained neural network model

00:24:09.673 --> 00:24:11.764
to synthesize new text

00:24:11.764 --> 00:24:13.592
that kind of looks similar in spirit

00:24:13.592 --> 00:24:15.103
to the text that it was trained on.

00:24:15.103 --> 00:24:16.312
The way that this will work

00:24:16.312 --> 00:24:17.833
is we'll typically see the model

00:24:17.833 --> 00:24:19.634
with some input prefix of text.

00:24:19.634 --> 00:24:22.716
In this case, the prefix is
just the single letter h,

00:24:22.716 --> 00:24:24.327
and now we'll feed that letter h

00:24:24.327 --> 00:24:27.295
through the first time step of
our recurrent neural network.

00:24:27.295 --> 00:24:30.335
It will product this
distribution of scores

00:24:30.335 --> 00:24:32.916
over all the characters in the vocabulary.

00:24:32.916 --> 00:24:35.592
Now, at training time,
we'll use these scores

00:24:35.592 --> 00:24:37.501
to actually sample from it.

00:24:37.501 --> 00:24:38.893
So we'll use a softmaxt function

00:24:38.893 --> 00:24:41.421
to convert those scores into
a probability distribution

00:24:41.421 --> 00:24:44.162
and then we will sample from
that probability distribution

00:24:44.162 --> 00:24:47.362
to actually synthesize the
second letter in the sequence.

00:24:47.362 --> 00:24:50.491
And in this case, even though
the scores were pretty bad,

00:24:50.491 --> 00:24:53.061
maybe we got lucky and
sampled the letter e

00:24:53.061 --> 00:24:54.771
from this probability distribution.

00:24:54.771 --> 00:24:56.962
And now, we'll take this letter e

00:24:56.962 --> 00:24:59.253
that was sampled from this distribution

00:24:59.253 --> 00:25:01.522
and feed it back as input into the network

00:25:01.522 --> 00:25:02.492
at the next time step.

00:25:02.492 --> 00:25:05.118
Now, we'll take this e,
pull it down from the top,

00:25:05.118 --> 00:25:07.170
feed it back into the network

00:25:07.170 --> 00:25:10.071
as one of these, sort of, one
hot vectorial representations,

00:25:10.071 --> 00:25:11.530
and then repeat the process

00:25:11.530 --> 00:25:15.008
in order to synthesize the
second letter in the output.

00:25:15.008 --> 00:25:17.290
And we can repeat this
process over and over again

00:25:17.290 --> 00:25:20.722
to synthesize a new sequence
using this trained model,

00:25:20.722 --> 00:25:22.442
where we're synthesizing the sequence

00:25:22.442 --> 00:25:23.770
one character at a time

00:25:23.770 --> 00:25:25.922
using these predicted
probability distributions

00:25:25.922 --> 00:25:27.712
at each time step.

00:25:27.712 --> 00:25:28.545
Question?

00:25:34.792 --> 00:25:36.181
Yeah, that's a great question.

00:25:36.181 --> 00:25:37.960
So the question is why might we sample

00:25:37.960 --> 00:25:39.381
instead of just taking the character

00:25:39.381 --> 00:25:41.315
with the largest score?

00:25:41.315 --> 00:25:42.398
In this case,

00:25:43.294 --> 00:25:45.624
because of the probability
distribution that we had,

00:25:45.624 --> 00:25:47.451
it was impossible to
get the right character,

00:25:47.451 --> 00:25:49.642
so we had the sample so
the example could work out,

00:25:49.642 --> 00:25:51.512
and it would make sense.

00:25:51.512 --> 00:25:54.384
But in practice,
sometimes you'll see both.

00:25:54.384 --> 00:25:56.432
So sometimes you'll just
take the argmax probability,

00:25:56.432 --> 00:25:59.482
and that will sometimes be
a little bit more stable,

00:25:59.482 --> 00:26:01.791
but one advantage of sampling, in general,

00:26:01.791 --> 00:26:04.264
is that it lets you get
diversity from your models.

00:26:04.264 --> 00:26:07.242
Sometimes you might have the same input,

00:26:07.242 --> 00:26:08.490
maybe the same prefix,

00:26:08.490 --> 00:26:09.722
or in the case of image captioning,

00:26:09.722 --> 00:26:11.421
maybe the same image.

00:26:11.421 --> 00:26:13.583
But then if you sample rather
than taking the argmax,

00:26:13.583 --> 00:26:15.562
then you'll see that
sometimes these trained models

00:26:15.562 --> 00:26:18.352
are actually able to produce
multiple different types

00:26:18.352 --> 00:26:20.032
of reasonable output sequences,

00:26:20.032 --> 00:26:21.053
depending on the kind,

00:26:21.053 --> 00:26:22.573
depending on which samples they take

00:26:22.573 --> 00:26:23.824
at the first time steps.

00:26:23.824 --> 00:26:25.973
It's actually kind of a benefit

00:26:25.973 --> 00:26:29.213
cause we can get now more
diversity in our outputs.

00:26:29.213 --> 00:26:30.630
Another question?

00:26:35.143 --> 00:26:37.153
Could we feed in the softmax vector

00:26:37.153 --> 00:26:38.602
instead of the one element vector?

00:26:38.602 --> 00:26:40.435
You mean at test time?

00:26:46.162 --> 00:26:47.992
Yeah yeah, so the
question is, at test time,

00:26:47.992 --> 00:26:49.712
could we feed in this whole softmax vector

00:26:49.712 --> 00:26:51.373
rather than a one hot vector?

00:26:51.373 --> 00:26:52.752
There's kind of two problems with that.

00:26:52.752 --> 00:26:54.762
One is that that's very different

00:26:54.762 --> 00:26:56.782
from the data that it
saw at training time.

00:26:56.782 --> 00:26:58.792
In general, if you ask your model

00:26:58.792 --> 00:26:59.943
to do something at test time,

00:26:59.943 --> 00:27:01.223
which is different from training time,

00:27:01.223 --> 00:27:02.752
then it'll usually blow up.

00:27:02.752 --> 00:27:03.784
It'll usually give you garbage

00:27:03.784 --> 00:27:05.413
and you'll usually be sad.

00:27:05.413 --> 00:27:07.083
The other problem is that in practice,

00:27:07.083 --> 00:27:09.112
our vocabularies might be very large.

00:27:09.112 --> 00:27:10.480
So maybe, in this simple example,

00:27:10.480 --> 00:27:12.182
our vocabulary is only four elements,

00:27:12.182 --> 00:27:13.202
so it's not a big problem.

00:27:13.202 --> 00:27:15.611
But if you're thinking about
generating words one at a time,

00:27:15.611 --> 00:27:18.773
now your vocabulary is every
word in the English language,

00:27:18.773 --> 00:27:21.162
which could be something like
tens of thousands of elements.

00:27:21.162 --> 00:27:23.642
So in practice, this first element,

00:27:23.642 --> 00:27:26.373
this first operation that's
taking in this one hot vector,

00:27:26.373 --> 00:27:29.173
is often performed using
sparse vector operations

00:27:29.173 --> 00:27:30.533
rather than dense factors.

00:27:30.533 --> 00:27:32.693
It would be, sort of,
computationally really bad

00:27:32.693 --> 00:27:35.018
if you wanted to have this load of

00:27:35.018 --> 00:27:36.827
10,000 elements softmax vector.

00:27:36.827 --> 00:27:38.802
So that's usually why we
use a one hot instead,

00:27:38.802 --> 00:27:40.302
even at test time.

00:27:42.121 --> 00:27:44.482
This idea that we have a sequence

00:27:44.482 --> 00:27:47.104
and we produce an output at
every time step of the sequence

00:27:47.104 --> 00:27:48.693
and then finally compute some loss,

00:27:48.693 --> 00:27:51.543
this is sometimes called
backpropagation through time

00:27:51.543 --> 00:27:53.962
because you're imagining
that in the forward pass,

00:27:53.962 --> 00:27:55.819
you're kind of stepping
forward through time

00:27:55.819 --> 00:27:57.053
and then during the backward pass,

00:27:57.053 --> 00:27:59.042
you're sort of going
backwards through time

00:27:59.042 --> 00:28:00.762
to compute all your gradients.

00:28:00.762 --> 00:28:03.389
This can actually be kind of problematic

00:28:03.389 --> 00:28:06.162
if you want to train the sequences
that are very, very long.

00:28:06.162 --> 00:28:08.333
So if you imagine that we
were kind of trying to train

00:28:08.333 --> 00:28:09.602
a neural network language model

00:28:09.602 --> 00:28:12.107
on maybe the entire text of Wikipedia,

00:28:12.107 --> 00:28:13.171
which is, by the way,

00:28:13.171 --> 00:28:15.453
something that people
do pretty frequently,

00:28:15.453 --> 00:28:16.672
this would be super slow,

00:28:16.672 --> 00:28:19.282
and every time we made a gradient step,

00:28:19.282 --> 00:28:20.670
we would have to make a forward pass

00:28:20.670 --> 00:28:23.328
through the entire text
of all of wikipedia,

00:28:23.328 --> 00:28:25.901
and then make a backward pass
through all of wikipedia,

00:28:25.901 --> 00:28:27.813
and then make a single gradient update.

00:28:27.813 --> 00:28:28.750
And that would be super slow.

00:28:28.750 --> 00:28:29.984
Your model would never converge.

00:28:29.984 --> 00:28:31.853
It would also take a
ridiculous amount of memory

00:28:31.853 --> 00:28:34.172
so this would be just really bad.

00:28:34.172 --> 00:28:36.384
In practice, what people
do is this, sort of,

00:28:36.384 --> 00:28:39.933
approximation called truncated
backpropagation through time.

00:28:39.933 --> 00:28:41.522
Here, the idea is that,

00:28:41.522 --> 00:28:43.973
even though our input
sequence is very, very long,

00:28:43.973 --> 00:28:45.562
and even potentially infinite,

00:28:45.562 --> 00:28:47.184
what we'll do is that during,

00:28:47.184 --> 00:28:48.521
when we're training the model,

00:28:48.521 --> 00:28:50.544
we'll step forward for
some number of steps,

00:28:50.544 --> 00:28:54.402
maybe like a hundred is
kind of a ballpark number

00:28:54.402 --> 00:28:56.232
that people frequently use,

00:28:56.232 --> 00:28:58.623
and we'll step forward
for maybe a hundred steps,

00:28:58.623 --> 00:29:01.944
compute a loss only over this
sub sequence of the data,

00:29:01.944 --> 00:29:04.312
and then back propagate
through this sub sequence,

00:29:04.312 --> 00:29:06.261
and now make a gradient step.

00:29:06.261 --> 00:29:08.102
And now, when we repeat, well,

00:29:08.102 --> 00:29:10.072
we still have these hidden states

00:29:10.072 --> 00:29:12.064
that we computed from the first batch,

00:29:12.064 --> 00:29:14.664
and now, when we compute
this next batch of data,

00:29:14.664 --> 00:29:17.933
we will carry those hidden
states forward in time,

00:29:17.933 --> 00:29:20.631
so the forward pass will
be exactly the same.

00:29:20.631 --> 00:29:23.184
But now when we compute a gradient step

00:29:23.184 --> 00:29:24.392
for this next batch of data,

00:29:24.392 --> 00:29:28.124
we will only backpropagate
again through this second batch.

00:29:28.124 --> 00:29:29.659
Now, we'll make a gradient step

00:29:29.659 --> 00:29:32.760
based on this truncated
backpropagation through time.

00:29:32.760 --> 00:29:34.392
This process will continue,

00:29:34.392 --> 00:29:36.440
where now when we make the next batch,

00:29:36.440 --> 00:29:38.250
we'll again copy these
hidden states forward,

00:29:38.250 --> 00:29:40.930
but then step forward
and then step backward,

00:29:40.930 --> 00:29:43.840
but only for some small
number of time steps.

00:29:43.840 --> 00:29:44.673
So this is,

00:29:44.673 --> 00:29:45.600
you can kind of think of this

00:29:45.600 --> 00:29:48.250
as being an alegist who's
the cast at gradient descent

00:29:48.250 --> 00:29:49.872
in the case of sequences.

00:29:49.872 --> 00:29:52.280
Remember, when we talked
about training our models

00:29:52.280 --> 00:29:53.690
on large data sets,

00:29:53.690 --> 00:29:54.672
then these data sets,

00:29:54.672 --> 00:29:56.880
it would be super expensive
to compute the gradients

00:29:56.880 --> 00:29:58.720
over every element in the data set.

00:29:58.720 --> 00:30:00.861
So instead, we kind of take small samples,

00:30:00.861 --> 00:30:02.520
small mini batches instead,

00:30:02.520 --> 00:30:05.290
and use mini batches of data
to compute gradient stops

00:30:05.290 --> 00:30:08.053
in any kind of image classification case.

00:30:08.053 --> 00:30:08.886
Question?

00:30:12.441 --> 00:30:14.415
Is this kind of, the question is,

00:30:14.415 --> 00:30:15.989
is this kind of making
the Mark Hobb assumption?

00:30:15.989 --> 00:30:17.239
No, not really.

00:30:18.100 --> 00:30:20.001
Because we're carrying
this hidden state forward

00:30:20.001 --> 00:30:21.442
in time forever.

00:30:21.442 --> 00:30:23.512
It's making a Marcovian assumption

00:30:23.512 --> 00:30:25.792
in the sense that, conditioned
on the hidden state,

00:30:25.792 --> 00:30:27.733
but the hidden state is all that we need

00:30:27.733 --> 00:30:28.971
to predict the entire future

00:30:28.971 --> 00:30:31.101
of the sequence.

00:30:31.101 --> 00:30:33.019
But that assumption is kind of built

00:30:33.019 --> 00:30:35.012
into the recurrent neural network formula

00:30:35.012 --> 00:30:35.941
from the start.

00:30:35.941 --> 00:30:37.269
And that's not really particular

00:30:37.269 --> 00:30:39.032
to back propagation through time.

00:30:39.032 --> 00:30:40.372
Back propagation through time,

00:30:40.372 --> 00:30:43.061
or sorry, truncated back prop though time

00:30:43.061 --> 00:30:44.352
is just the way to
approximate these gradients

00:30:44.352 --> 00:30:47.072
without going making a backwards pass

00:30:47.072 --> 00:30:51.239
through your potentially
very large sequence of data.

00:30:52.677 --> 00:30:55.301
This all sounds very
complicated and confusing

00:30:55.301 --> 00:30:56.871
and it sounds like a lot of code to write,

00:30:56.871 --> 00:30:59.649
but in fact, this can
acutally be pretty concise.

00:30:59.649 --> 00:31:03.418
Andrea has this example of
what he calls min-char-rnn,

00:31:03.418 --> 00:31:04.816
that does all of this stuff

00:31:04.816 --> 00:31:07.474
in just like a 112 lines of Python.

00:31:07.474 --> 00:31:09.045
It handles building the vocabulary.

00:31:09.045 --> 00:31:09.925
It trains the model

00:31:09.925 --> 00:31:11.725
with truncated back
propagation through time.

00:31:11.725 --> 00:31:13.936
And then, it can actually
sample from that model

00:31:13.936 --> 00:31:16.584
in actually not too much code.

00:31:16.584 --> 00:31:17.925
So even though this sounds like

00:31:17.925 --> 00:31:18.965
kind of a big, scary process,

00:31:18.965 --> 00:31:20.862
it's actually not too difficult.

00:31:20.862 --> 00:31:22.314
I'd encourage you, if you're confused,

00:31:22.314 --> 00:31:23.456
to maybe go check this out

00:31:23.456 --> 00:31:25.125
and step through the
code on your own time,

00:31:25.125 --> 00:31:27.074
and see, kind of, all
of these concrete steps

00:31:27.074 --> 00:31:27.954
happening in code.

00:31:27.954 --> 00:31:30.184
So this is all in just a single file,

00:31:30.184 --> 00:31:31.723
all using numpy with no dependencies.

00:31:31.723 --> 00:31:34.473
This was relatively easy to read.

00:31:35.584 --> 00:31:36.896
So then, once we have this idea

00:31:36.896 --> 00:31:39.056
of training a recurrent
neural network language model,

00:31:39.056 --> 00:31:41.593
we can actually have a
lot of fun with this.

00:31:41.593 --> 00:31:43.984
And we can take in, sort
of, any text that we want.

00:31:43.984 --> 00:31:46.424
Take in, like, whatever
random text you can think of

00:31:46.424 --> 00:31:47.384
from the internet,

00:31:47.384 --> 00:31:49.656
train our recurrent neural
network language model

00:31:49.656 --> 00:31:50.616
on this text,

00:31:50.616 --> 00:31:52.304
and then generate new text.

00:31:52.304 --> 00:31:55.502
So in this example, we
took this entire text

00:31:55.502 --> 00:31:57.394
of all of Shakespeare's works,

00:31:57.394 --> 00:31:59.314
and then used that to train

00:31:59.314 --> 00:32:00.885
a recurrent neural network language model

00:32:00.885 --> 00:32:02.634
on all of Shakespeare.

00:32:02.634 --> 00:32:03.896
And you can see that the
beginning of training,

00:32:03.896 --> 00:32:05.773
it's kind of producing maybe
random gibberish garbage,

00:32:05.773 --> 00:32:08.216
but throughout the course of training,

00:32:08.216 --> 00:32:11.584
it ends up producing things
that seem relatively reasonable.

00:32:11.584 --> 00:32:12.784
And after you've,

00:32:12.784 --> 00:32:14.272
after this model has
been trained pretty well,

00:32:14.272 --> 00:32:16.104
then it produces text that seems,

00:32:16.104 --> 00:32:18.224
kind of, Shakespeare-esque to me.

00:32:18.224 --> 00:32:20.264
"Why do what that day," replied,

00:32:20.264 --> 00:32:23.184
whatever, right, you can read this.

00:32:23.184 --> 00:32:24.754
Like, it kind of looks
kind of like Shakespeare.

00:32:24.754 --> 00:32:27.184
And if you actually train
this model even more,

00:32:27.184 --> 00:32:28.765
and let it converge even further,

00:32:28.765 --> 00:32:31.264
and then sample these
even longer sequences,

00:32:31.264 --> 00:32:33.904
you can see that it learns
all kinds of crazy cool stuff

00:32:33.904 --> 00:32:35.864
that really looks like a Shakespeare play.

00:32:35.864 --> 00:32:38.856
It knows that it uses,
maybe, these headings

00:32:38.856 --> 00:32:40.016
to say who's speaking.

00:32:40.016 --> 00:32:42.224
Then it produces these bits of text

00:32:42.224 --> 00:32:43.736
that have crazy dialogue

00:32:43.736 --> 00:32:45.565
that sounds kind of Shakespeare-esque.

00:32:45.565 --> 00:32:46.714
It knows to put line breaks

00:32:46.714 --> 00:32:47.744
in between these different things.

00:32:47.744 --> 00:32:49.342
And this is all, like, really cool,

00:32:49.342 --> 00:32:52.958
all just sort of learned from
the structure of the data.

00:32:52.958 --> 00:32:55.536
We can actually get
even crazier than this.

00:32:55.536 --> 00:32:58.368
This was one of my favorite examples.

00:32:58.368 --> 00:33:00.443
I found online, there's this.

00:33:00.443 --> 00:33:02.725
Is anyone a mathematician in this room?

00:33:02.725 --> 00:33:05.914
Has anyone taken an algebraic
topology course by any chance?

00:33:05.914 --> 00:33:07.325
Wow, a couple, that's impressive.

00:33:07.325 --> 00:33:10.765
So you probably know more
algebraic topology than me,

00:33:10.765 --> 00:33:12.165
but I found this open source

00:33:12.165 --> 00:33:15.114
algebraic topology textbook online.

00:33:15.114 --> 00:33:16.514
It's just a whole bunch of tech files

00:33:16.514 --> 00:33:19.554
that are like this
super dense mathematics.

00:33:19.554 --> 00:33:22.593
And LaTac, cause LaTac is sort of this,

00:33:22.593 --> 00:33:24.774
let's you write equations and diagrams

00:33:24.774 --> 00:33:26.325
and everything just using plain text.

00:33:26.325 --> 00:33:27.455
We can actually train our

00:33:27.455 --> 00:33:28.816
recurrent neural network language model

00:33:28.816 --> 00:33:31.457
on the raw Latac source code

00:33:31.457 --> 00:33:33.146
of this algebraic topology textbook.

00:33:33.146 --> 00:33:37.687
And if we do that, then after
we sample from the model,

00:33:37.687 --> 00:33:39.208
then we get something that seems like,

00:33:39.208 --> 00:33:41.446
kind of like algebraic topology.

00:33:41.446 --> 00:33:43.797
So it knows to like put equations.

00:33:43.797 --> 00:33:46.679
It puts all kinds of crazy stuff.

00:33:46.679 --> 00:33:48.717
It's like, to prove study,

00:33:48.717 --> 00:33:51.574
we see that F sub U is
a covering of x prime,

00:33:51.574 --> 00:33:52.773
blah, blah, blah, blah, blah.

00:33:52.773 --> 00:33:53.992
It knows where to put unions.

00:33:53.992 --> 00:33:56.576
It knows to put squares
at the end of proofs.

00:33:56.576 --> 00:33:57.528
It makes lemmas.

00:33:57.528 --> 00:34:00.026
It makes references to previous lemmas.

00:34:00.026 --> 00:34:01.225
Right, like we hear, like.

00:34:01.225 --> 00:34:03.781
It's namely a bi-lemma question.

00:34:03.781 --> 00:34:05.917
We see that R is geometrically something.

00:34:05.917 --> 00:34:08.417
So it's actually pretty crazy.

00:34:09.496 --> 00:34:12.913
It also sometimes tries to make diagrams.

00:34:14.239 --> 00:34:16.132
For those of you that have
taken algebraic topology,

00:34:16.132 --> 00:34:17.322
you know that these commutative diagrams

00:34:17.322 --> 00:34:19.313
are kind of a thing
that you work with a lot

00:34:19.313 --> 00:34:21.153
So it kind of got the general gist

00:34:21.154 --> 00:34:22.353
of how to make those diagrams,

00:34:22.353 --> 00:34:26.379
but they actually don't make any sense.

00:34:26.380 --> 00:34:28.130
And actually,

00:34:28.130 --> 00:34:29.440
one of my favorite examples here

00:34:29.440 --> 00:34:32.728
is that it sometimes omits proofs.

00:34:32.728 --> 00:34:33.788
So it'll sometimes say,

00:34:33.789 --> 00:34:35.418
it'll sometimes say something like

00:34:35.418 --> 00:34:39.695
theorem, blah, blah, blah,
blah, blah, proof omitted.

00:34:39.695 --> 00:34:41.559
This thing kind of has gotten the gist

00:34:41.559 --> 00:34:45.392
of how some of these
math textbooks look like.

00:34:47.831 --> 00:34:49.122
We can have a lot of fun with this.

00:34:49.123 --> 00:34:50.792
So we also tried training
one of these models

00:34:50.792 --> 00:34:53.498
on the entire source
code of the Linux kernel.

00:34:53.498 --> 00:34:55.791
'Cause again, this character level stuff

00:34:55.792 --> 00:34:56.801
that we can train on,

00:34:56.801 --> 00:34:58.630
And then, when we sample this,

00:34:58.630 --> 00:35:01.483
it acutally again looks
like C source code.

00:35:01.483 --> 00:35:03.534
It knows how to write if statements.

00:35:03.534 --> 00:35:06.192
It has, like, pretty good
code formatting skills.

00:35:06.192 --> 00:35:08.110
It knows to indent after
these if statements.

00:35:08.110 --> 00:35:09.552
It knows to put curly braces.

00:35:09.552 --> 00:35:12.289
It actually even makes
comments about some things

00:35:12.289 --> 00:35:15.052
that are usually nonsense.

00:35:15.052 --> 00:35:18.089
One problem with this model is that

00:35:18.089 --> 00:35:19.842
it knows how to declare variables.

00:35:19.842 --> 00:35:23.355
But it doesn't always use the
variables that it declares.

00:35:23.355 --> 00:35:24.414
And sometimes it tries

00:35:24.414 --> 00:35:26.134
to use variables that
haven't been declared.

00:35:26.134 --> 00:35:27.554
This wouldn't compile.

00:35:27.554 --> 00:35:28.814
I would not recommend sending this

00:35:28.814 --> 00:35:30.555
as a pull request to Linux.

00:35:30.555 --> 00:35:34.606
This thing also figures
out how to recite the GNU,

00:35:34.606 --> 00:35:37.867
this GNU license character by character.

00:35:37.867 --> 00:35:40.454
It kind of knows that you
need to recite the GNU license

00:35:40.454 --> 00:35:42.696
and after the license comes some includes,

00:35:42.696 --> 00:35:44.962
then some other includes,
then source code.

00:35:44.962 --> 00:35:46.864
This thing has actually
learned quite a lot

00:35:46.864 --> 00:35:48.836
about the general structure of the data.

00:35:48.836 --> 00:35:50.096
Where, again, during training,

00:35:50.096 --> 00:35:51.518
all we asked this model to do

00:35:51.518 --> 00:35:53.398
was try to predict the next
character in the sequence.

00:35:53.398 --> 00:35:55.406
We didn't tell it any of this structure,

00:35:55.406 --> 00:35:57.534
but somehow, just through the course

00:35:57.534 --> 00:35:58.856
of this training process,

00:35:58.856 --> 00:36:01.438
it learned a lot about
the latent structure

00:36:01.438 --> 00:36:03.355
in the sequential data.

00:36:05.808 --> 00:36:07.406
Yeah, so it knows how to write code.

00:36:07.406 --> 00:36:10.246
It does a lot of cool stuff.

00:36:10.246 --> 00:36:13.575
I had this paper with
Andre a couple years ago

00:36:13.575 --> 00:36:14.966
where we trained a bunch of these models

00:36:14.966 --> 00:36:17.176
and then we wanted to try
to poke into the brains

00:36:17.176 --> 00:36:18.009
of these models

00:36:18.009 --> 00:36:19.376
and figure out like what are they doing

00:36:19.376 --> 00:36:20.856
and why are they working.

00:36:20.856 --> 00:36:22.027
So we saw, in our,

00:36:22.027 --> 00:36:23.386
these recurring neural networks

00:36:23.386 --> 00:36:24.947
has this hidden vector which is,

00:36:24.947 --> 00:36:28.165
maybe, some vector that's
updated over every time step.

00:36:28.165 --> 00:36:29.838
And then what we wanted
to try to figure out is

00:36:29.838 --> 00:36:32.158
could we find some elements of this vector

00:36:32.158 --> 00:36:34.176
that have some Symantec
interpretable meaning.

00:36:34.176 --> 00:36:35.336
So what we did is

00:36:35.336 --> 00:36:37.096
we trained a neural
network language model,

00:36:37.096 --> 00:36:38.507
one of these character level models

00:36:38.507 --> 00:36:39.917
on one of these data sets,

00:36:39.917 --> 00:36:42.965
and then we picked one of the
elements in that hidden vector

00:36:42.965 --> 00:36:45.406
and now we look at what is the
value of that hidden vector

00:36:45.406 --> 00:36:47.923
over the course of a sequence

00:36:47.923 --> 00:36:49.926
to try to get some sense of maybe

00:36:49.926 --> 00:36:52.206
what these different hidden
states are looking for.

00:36:52.206 --> 00:36:54.086
When you do this, a lot
of them end up looking

00:36:54.086 --> 00:36:56.406
kind of like random gibberish garbage.

00:36:56.406 --> 00:36:57.736
So here again, what we've done,

00:36:57.736 --> 00:36:59.806
is we've picked one
element of that vector,

00:36:59.806 --> 00:37:01.467
and now we run the sequence forward

00:37:01.467 --> 00:37:02.686
through the trained model,

00:37:02.686 --> 00:37:04.056
and now the color of each character

00:37:04.056 --> 00:37:07.271
corresponds to the
magnitude of that single

00:37:07.271 --> 00:37:09.446
scaler element of the hidden
vector at every time step

00:37:09.446 --> 00:37:10.876
when it's reading the sequence.

00:37:10.876 --> 00:37:12.006
So you can see that a lot

00:37:12.006 --> 00:37:13.958
of the vectors in these hidden states

00:37:13.958 --> 00:37:15.883
are kind of not very interpretable.

00:37:15.883 --> 00:37:17.404
It seems like they're
kind of doing some of this

00:37:17.404 --> 00:37:19.158
low level language modeling

00:37:19.158 --> 00:37:21.396
to figure out what
character should come next.

00:37:21.396 --> 00:37:23.438
But some of them end up quite nice.

00:37:23.438 --> 00:37:26.375
So here we found this vector
that is looking for quotes.

00:37:26.375 --> 00:37:28.507
You can see that there's
this one hidden element,

00:37:28.507 --> 00:37:30.318
this one element in the vector,

00:37:30.318 --> 00:37:32.486
that is off, off, off, off, off blue

00:37:32.486 --> 00:37:33.827
and then once it hits a quote,

00:37:33.827 --> 00:37:37.136
it turns on and remains
on for the duration

00:37:37.136 --> 00:37:38.303
of this quote.

00:37:39.187 --> 00:37:41.025
And now when we hit the
second quotation mark,

00:37:41.025 --> 00:37:42.598
then that cell turns off.

00:37:42.598 --> 00:37:44.958
So somehow, even though
this model was only trained

00:37:44.958 --> 00:37:46.606
to predict the next
character in a sequence,

00:37:46.606 --> 00:37:48.856
it somehow learned that a useful thing,

00:37:48.856 --> 00:37:49.976
in order to do this,

00:37:49.976 --> 00:37:54.222
might be to have some cell
that's trying to detect quotes.

00:37:54.222 --> 00:37:55.507
We also found this other cell

00:37:55.507 --> 00:37:58.956
that is, looks like it's
counting the number of characters

00:37:58.956 --> 00:38:00.104
since a line break.

00:38:00.104 --> 00:38:01.507
So you can see that at the
beginning of each line,

00:38:01.507 --> 00:38:04.176
this element starts off at zero.

00:38:04.176 --> 00:38:05.536
Throughout the course of the line,

00:38:05.536 --> 00:38:07.055
it's gradually more red,

00:38:07.055 --> 00:38:08.486
so that value increases.

00:38:08.486 --> 00:38:10.078
And then after the new line character,

00:38:10.078 --> 00:38:11.976
it resets to zero.

00:38:11.976 --> 00:38:13.356
So you can imagine that maybe this cell

00:38:13.356 --> 00:38:15.336
is letting the network keep track

00:38:15.336 --> 00:38:16.966
of when it needs to write

00:38:16.966 --> 00:38:19.545
to produce these new line characters.

00:38:19.545 --> 00:38:20.936
We also found some that,

00:38:20.936 --> 00:38:22.987
when we trained on the linux source code,

00:38:22.987 --> 00:38:25.307
we found some examples that are turning on

00:38:25.307 --> 00:38:27.158
inside the conditions of if statements.

00:38:27.158 --> 00:38:29.216
So this maybe allows the network

00:38:29.216 --> 00:38:32.364
to differentiate whether
it's outside an if statement

00:38:32.364 --> 00:38:33.838
or inside that condition,

00:38:33.838 --> 00:38:35.806
which might help it model
these sequences better.

00:38:35.806 --> 00:38:38.976
We also found some that
turn on in comments,

00:38:38.976 --> 00:38:41.467
or some that seem like they're counting

00:38:41.467 --> 00:38:44.765
the number of indentation levels.

00:38:44.765 --> 00:38:46.298
This is all just really cool stuff

00:38:46.298 --> 00:38:47.488
because it's saying that

00:38:47.488 --> 00:38:49.630
even though we are only
trying to train this model

00:38:49.630 --> 00:38:50.758
to predict next characters,

00:38:50.758 --> 00:38:52.229
it somehow ends up learning a lot

00:38:52.229 --> 00:38:55.646
of useful structure about the input data.

00:38:57.161 --> 00:38:59.307
One kind of thing that we often use,

00:38:59.307 --> 00:39:01.718
so this is not really been
computer vision so far,

00:39:01.718 --> 00:39:03.878
and we need to pull this
back to computer vision

00:39:03.878 --> 00:39:05.528
since this is a vision class.

00:39:05.528 --> 00:39:07.107
We've alluded many times to this

00:39:07.107 --> 00:39:08.747
image captioning model

00:39:08.747 --> 00:39:10.528
where we want to build
models that can input

00:39:10.528 --> 00:39:14.376
an image and then output a
caption in natural language.

00:39:14.376 --> 00:39:17.408
There were a bunch of
papers a couple years ago

00:39:17.408 --> 00:39:19.787
that all had relatively
similar approaches.

00:39:19.787 --> 00:39:22.776
But I'm showing the figure
from the paper from our lab

00:39:22.776 --> 00:39:25.026
in a totally un-biased way.

00:39:26.876 --> 00:39:29.198
But, the idea here is that the caption

00:39:29.198 --> 00:39:31.648
is this variably length
sequence that we might,

00:39:31.648 --> 00:39:34.198
the sequence might have different numbers

00:39:34.198 --> 00:39:35.608
of words for different captions.

00:39:35.608 --> 00:39:37.059
So this is a totally natural fit

00:39:37.059 --> 00:39:39.947
for a recurrent neural
network language model.

00:39:39.947 --> 00:39:41.598
So then what this model looks like

00:39:41.598 --> 00:39:43.878
is we have some convolutional network

00:39:43.878 --> 00:39:44.958
which will input the,

00:39:44.958 --> 00:39:46.786
which will take as input the image,

00:39:46.786 --> 00:39:48.397
and we've seen a lot about how

00:39:48.397 --> 00:39:50.379
convolution networks work at this point,

00:39:50.379 --> 00:39:52.753
and that convolutional
network will produce

00:39:52.753 --> 00:39:54.569
a summary vector of the image

00:39:54.569 --> 00:39:56.227
which will then feed
into the first time step

00:39:56.227 --> 00:39:58.928
of one of these recurrent
neural network language models

00:39:58.928 --> 00:40:02.888
which will then produce words
of the caption one at a time.

00:40:02.888 --> 00:40:04.945
So the way that this kind
of works at test time

00:40:04.945 --> 00:40:06.425
after the model is trained

00:40:06.425 --> 00:40:07.847
looks almost exactly the same

00:40:07.847 --> 00:40:09.665
as these character level language models

00:40:09.665 --> 00:40:11.178
that we saw a little bit ago.

00:40:11.178 --> 00:40:12.699
We'll take our input image,

00:40:12.699 --> 00:40:14.499
feed it through our convolutional network.

00:40:14.499 --> 00:40:16.987
But now instead of
taking the softmax scores

00:40:16.987 --> 00:40:18.638
from an image net model,

00:40:18.638 --> 00:40:22.997
we'll instead take this
4,096 dimensional vector

00:40:22.997 --> 00:40:24.915
from the end of the model,

00:40:24.915 --> 00:40:27.998
and we'll take that vector and use it

00:40:29.038 --> 00:40:31.377
to summarize the whole
content of the image.

00:40:31.377 --> 00:40:34.208
Now, remember when we talked
about RNN language models,

00:40:34.208 --> 00:40:36.379
we said that we need to
see the language model

00:40:36.379 --> 00:40:37.947
with that first initial input

00:40:37.947 --> 00:40:39.528
to tell it to start generating text.

00:40:39.528 --> 00:40:42.118
So in this case, we'll give
it some special start token,

00:40:42.118 --> 00:40:45.236
which is just saying, hey, this
is the start of a sentence.

00:40:45.236 --> 00:40:46.728
Please start generating some text

00:40:46.728 --> 00:40:49.006
conditioned on this image information.

00:40:49.006 --> 00:40:52.338
So now previously, we saw that
in this RNN language model,

00:40:52.338 --> 00:40:55.328
we had these matrices that
were taking the previous,

00:40:55.328 --> 00:40:57.107
the input at the current time step

00:40:57.107 --> 00:40:58.558
and the hidden state of
the previous time step

00:40:58.558 --> 00:41:01.240
and combining those to
get the next hidden state.

00:41:01.240 --> 00:41:04.678
Well now, we also need to add
in this image information.

00:41:04.678 --> 00:41:06.667
So one way, people play
around with exactly

00:41:06.667 --> 00:41:09.288
different ways to incorporate
this image information,

00:41:09.288 --> 00:41:10.688
but one simple way is just to add

00:41:10.688 --> 00:41:13.105
a third weight matrix that is

00:41:14.561 --> 00:41:16.779
adding in this image
information at every time step

00:41:16.779 --> 00:41:18.598
to compute the next hidden state.

00:41:18.598 --> 00:41:21.145
So now, we'll compute this distribution

00:41:21.145 --> 00:41:23.635
over all scores in our vocabulary

00:41:23.635 --> 00:41:25.056
and here, our vocabulary is something

00:41:25.056 --> 00:41:25.907
like all English words,

00:41:25.907 --> 00:41:27.307
so it could be pretty large.

00:41:27.307 --> 00:41:28.699
We'll sample from that distribution

00:41:28.699 --> 00:41:32.099
and now pass that word back as
input at the next time step.

00:41:32.099 --> 00:41:34.488
And that will then feed that word in,

00:41:34.488 --> 00:41:37.418
again get a distribution
over all words in the vocab,

00:41:37.418 --> 00:41:39.546
and again sample to produce the next word.

00:41:39.546 --> 00:41:42.248
So then, after that thing is all done,

00:41:42.248 --> 00:41:44.477
we'll maybe generate,

00:41:44.477 --> 00:41:47.538
we'll generate this complete sentence.

00:41:47.538 --> 00:41:50.956
We stop generation once we
sample the special ends token,

00:41:50.956 --> 00:41:52.887
which kind of corresponds to the period

00:41:52.887 --> 00:41:54.347
at the end of the sentence.

00:41:54.347 --> 00:41:56.419
Then once the network
samples this ends token,

00:41:56.419 --> 00:41:57.859
we stop generation and we're done

00:41:57.859 --> 00:42:00.328
and we've gotten our
caption for this image.

00:42:00.328 --> 00:42:01.798
And now, during training,

00:42:01.798 --> 00:42:03.419
we trained this thing to generate,

00:42:03.419 --> 00:42:04.939
like we put an end token at the end

00:42:04.939 --> 00:42:06.739
of every caption during training

00:42:06.739 --> 00:42:08.408
so that the network kind
of learned during training

00:42:08.408 --> 00:42:10.907
that end tokens come at
the end of sequences.

00:42:10.907 --> 00:42:12.958
So then, during test time,

00:42:12.958 --> 00:42:14.739
it tends to sample these end tokens

00:42:14.739 --> 00:42:16.848
once it's done generating.

00:42:16.848 --> 00:42:18.457
So we trained this model

00:42:18.457 --> 00:42:20.139
in kind of a completely supervised way.

00:42:20.139 --> 00:42:21.547
You can find data sets

00:42:21.547 --> 00:42:24.763
that have images together with
natural language captions.

00:42:24.763 --> 00:42:26.805
Microsoft COCO is probably the biggest

00:42:26.805 --> 00:42:28.666
and most widely used for this task.

00:42:28.666 --> 00:42:30.191
But you can just train this model

00:42:30.191 --> 00:42:31.883
in a purely supervised way.

00:42:31.883 --> 00:42:34.935
And then backpropagate
through to jointly train

00:42:34.935 --> 00:42:37.653
both this recurrent neural
network language model

00:42:37.653 --> 00:42:38.635
and then also pass gradients

00:42:38.635 --> 00:42:40.846
back into this final layer of this the CNN

00:42:40.846 --> 00:42:42.396
and additionally update the weights

00:42:42.396 --> 00:42:45.580
of the CNN to jointly tune
all parts of the model

00:42:45.580 --> 00:42:47.837
to perform this task.

00:42:47.837 --> 00:42:49.528
Once you train these models,

00:42:49.528 --> 00:42:52.438
they actually do some
pretty reasonable things.

00:42:52.438 --> 00:42:54.997
These are some real results from a model,

00:42:54.997 --> 00:42:56.689
from one of these trained models,

00:42:56.689 --> 00:42:58.226
and it says things like a cat sitting

00:42:58.226 --> 00:43:00.995
on a suitcase on the floor,

00:43:00.995 --> 00:43:02.583
which is pretty impressive.

00:43:02.583 --> 00:43:04.975
It knows about cats
sitting on a tree branch,

00:43:04.975 --> 00:43:06.499
which is also pretty cool.

00:43:06.499 --> 00:43:07.796
It knows about two people walking

00:43:07.796 --> 00:43:09.631
on the beach with surfboards.

00:43:09.631 --> 00:43:11.755
So these models are
actually pretty powerful

00:43:11.755 --> 00:43:14.642
and can produce relatively
complex captions

00:43:14.642 --> 00:43:16.635
to describe the image.

00:43:16.635 --> 00:43:17.630
But that being said,

00:43:17.630 --> 00:43:19.808
these models are really not perfect.

00:43:19.808 --> 00:43:21.227
They're not magical.

00:43:21.227 --> 00:43:24.319
Just like any machine learning model,

00:43:24.319 --> 00:43:25.629
if you try to run them on data

00:43:25.629 --> 00:43:27.458
that was very different
from the training data,

00:43:27.458 --> 00:43:28.830
they don't work very well.

00:43:28.830 --> 00:43:30.364
So for example, this example,

00:43:30.364 --> 00:43:33.317
it says a woman is
holding a cat in her hand.

00:43:33.317 --> 00:43:35.421
There's clearly no cat in the image.

00:43:35.421 --> 00:43:36.863
But she is wearing a fur coat,

00:43:36.863 --> 00:43:38.106
and maybe the texture of that coat

00:43:38.106 --> 00:43:41.359
kind of looked like a cat to the model.

00:43:41.359 --> 00:43:42.903
Over here, we see a
woman standing on a beach

00:43:42.903 --> 00:43:44.379
holding a surfboard.

00:43:44.379 --> 00:43:46.744
Well, she's definitely
not holding a surfboard

00:43:46.744 --> 00:43:47.892
and she's doing a handstand,

00:43:47.892 --> 00:43:49.909
which is maybe the interesting
part of that image,

00:43:49.909 --> 00:43:51.820
and the model totally missed that.

00:43:51.820 --> 00:43:53.623
Also, over here, we see this example

00:43:53.623 --> 00:43:56.137
where there's this picture of a spider web

00:43:56.137 --> 00:43:57.168
in the tree branch,

00:43:57.168 --> 00:43:58.238
and it totally,

00:43:58.238 --> 00:43:59.071
and it says something like

00:43:59.071 --> 00:44:00.382
a bird sitting on a tree branch.

00:44:00.382 --> 00:44:01.802
So it totally missed the spider,

00:44:01.802 --> 00:44:03.144
but during training,

00:44:03.144 --> 00:44:05.139
it never really saw examples of spiders.

00:44:05.139 --> 00:44:06.722
It just knows that birds sit

00:44:06.722 --> 00:44:08.217
on tree branches during training.

00:44:08.217 --> 00:44:10.152
So it kind of makes these
reasonable mistakes.

00:44:10.152 --> 00:44:10.985
Or here at the bottom,

00:44:10.985 --> 00:44:12.155
it can't really tell the difference

00:44:12.155 --> 00:44:14.338
between this guy throwing
and catching the ball,

00:44:14.338 --> 00:44:15.712
but it does know that
it's a baseball player

00:44:15.712 --> 00:44:17.709
and there's balls and things involved.

00:44:17.709 --> 00:44:19.347
So again, just want to
say that these models

00:44:19.347 --> 00:44:20.441
are not perfect.

00:44:20.441 --> 00:44:23.313
They work pretty well when
you ask them to caption images

00:44:23.313 --> 00:44:25.109
that were similar to the training data,

00:44:25.109 --> 00:44:26.711
but they definitely have a hard time

00:44:26.711 --> 00:44:29.188
generalizing far beyond that.

00:44:29.188 --> 00:44:30.560
So another thing you'll sometimes see

00:44:30.560 --> 00:44:34.152
is this slightly more advanced
model called Attention,

00:44:34.152 --> 00:44:36.447
where now when we're generating
the words of this caption,

00:44:36.447 --> 00:44:39.120
we can allow the model
to steer it's attention

00:44:39.120 --> 00:44:41.036
to different parts of the image.

00:44:41.036 --> 00:44:44.670
And I don't want to spend
too much time on this.

00:44:44.670 --> 00:44:46.859
But the general way
that this works is that

00:44:46.859 --> 00:44:48.925
now our convolutional network,

00:44:48.925 --> 00:44:50.664
rather than producing a single vector

00:44:50.664 --> 00:44:52.227
summarizing the entire image,

00:44:52.227 --> 00:44:54.274
now it produces some grid of vectors

00:44:54.274 --> 00:44:55.683
that summarize the,

00:44:55.683 --> 00:44:58.503
that give maybe one vector
for each spatial location

00:44:58.503 --> 00:44:59.954
in the image.

00:44:59.954 --> 00:45:00.787
And now, when we,

00:45:00.787 --> 00:45:03.304
when this model runs forward,

00:45:03.304 --> 00:45:06.618
in addition to sampling the
vocabulary at every time step,

00:45:06.618 --> 00:45:08.523
it also produces a distribution

00:45:08.523 --> 00:45:09.964
over the locations in the image

00:45:09.964 --> 00:45:11.263
where it wants to look.

00:45:11.263 --> 00:45:13.581
And now this distribution
over image locations

00:45:13.581 --> 00:45:15.561
can be seen as a kind of a tension

00:45:15.561 --> 00:45:18.276
of where the model should
look during training.

00:45:18.276 --> 00:45:19.397
So now that first hidden state

00:45:19.397 --> 00:45:23.155
computes this distribution
over image locations,

00:45:23.155 --> 00:45:26.316
which then goes back to the set of vectors

00:45:26.316 --> 00:45:27.832
to give a single summary vector

00:45:27.832 --> 00:45:31.372
that maybe focuses the attention
on one part of that image.

00:45:31.372 --> 00:45:32.795
And now that summary vector gets fed,

00:45:32.795 --> 00:45:34.422
as an additional input,

00:45:34.422 --> 00:45:36.508
at the next time step
of the neural network.

00:45:36.508 --> 00:45:38.278
And now again, it will
produce two outputs.

00:45:38.278 --> 00:45:40.414
One is our distribution
over vocabulary words.

00:45:40.414 --> 00:45:43.714
And the other is a distribution
over image locations.

00:45:43.714 --> 00:45:45.402
This whole process will continue,

00:45:45.402 --> 00:45:47.577
and it will sort of do
these two different things

00:45:47.577 --> 00:45:49.160
at every time step.

00:45:50.296 --> 00:45:51.390
And after you train the model,

00:45:51.390 --> 00:45:53.905
then you can see that it kind of

00:45:53.905 --> 00:45:56.104
will shift it's attention around the image

00:45:56.104 --> 00:45:58.298
for every word that it
generates in the caption.

00:45:58.298 --> 00:45:59.827
Here you can see that it

00:45:59.827 --> 00:46:02.723
produced the caption,
a bird is flying over,

00:46:02.723 --> 00:46:04.436
I can't see that far.

00:46:04.436 --> 00:46:06.474
But you can see that its attention

00:46:06.474 --> 00:46:08.556
is shifting around
different parts of the image

00:46:08.556 --> 00:46:12.473
for each word in the
caption that it generates.

00:46:14.112 --> 00:46:15.417
There's this notion of hard attention

00:46:15.417 --> 00:46:16.544
versus soft attention,

00:46:16.544 --> 00:46:19.147
which I don't really want
to get into too much,

00:46:19.147 --> 00:46:21.434
but with this idea of soft attention,

00:46:21.434 --> 00:46:24.405
we're kind of taking
a weighted combination

00:46:24.405 --> 00:46:26.614
of all features from all image locations,

00:46:26.614 --> 00:46:28.275
whereas in the hard attention case,

00:46:28.275 --> 00:46:31.004
we're forcing the model to
select exactly one location

00:46:31.004 --> 00:46:34.259
to look at in the image at each time step.

00:46:34.259 --> 00:46:35.123
So the hard attention case

00:46:35.123 --> 00:46:36.946
where we're selecting
exactly one image location

00:46:36.946 --> 00:46:38.277
is a little bit tricky

00:46:38.277 --> 00:46:41.270
because that is not really
a differentiable function,

00:46:41.270 --> 00:46:42.774
so you need to do
something slightly fancier

00:46:42.774 --> 00:46:44.292
than vanilla backpropagation

00:46:44.292 --> 00:46:46.247
in order to just train the
model in that scenario.

00:46:46.247 --> 00:46:48.145
And I think we'll talk about
that a little bit later

00:46:48.145 --> 00:46:51.498
in the lecture on reinforcement learning.

00:46:51.498 --> 00:46:53.577
Now, when you look at after you train

00:46:53.577 --> 00:46:54.750
one of these attention models

00:46:54.750 --> 00:46:57.415
and then run it on to generate captions,

00:46:57.415 --> 00:46:59.326
you can see that it tends
to focus it's attention

00:46:59.326 --> 00:47:01.980
on maybe the salient or
semanticly meaningful part

00:47:01.980 --> 00:47:04.067
of the image when generating captions.

00:47:04.067 --> 00:47:06.419
You can see that the caption was a woman

00:47:06.419 --> 00:47:08.413
is throwing a frisbee in a park

00:47:08.413 --> 00:47:10.257
and you can see that this attention mask,

00:47:10.257 --> 00:47:11.637
when it generated the word,

00:47:11.637 --> 00:47:13.648
when the model generated the word frisbee,

00:47:13.648 --> 00:47:14.516
at the same time,

00:47:14.516 --> 00:47:17.403
it was focusing it's
attention on this image region

00:47:17.403 --> 00:47:18.976
that actually contains the frisbee.

00:47:18.976 --> 00:47:20.644
This is actually really cool.

00:47:20.644 --> 00:47:23.410
We did not tell the model
where it should be looking

00:47:23.410 --> 00:47:24.542
at every time step.

00:47:24.542 --> 00:47:26.552
It sort of figured all that out for itself

00:47:26.552 --> 00:47:27.955
during the training process.

00:47:27.955 --> 00:47:29.716
Because somehow, it
figured out that looking

00:47:29.716 --> 00:47:31.540
at that image region was
the right thing to do

00:47:31.540 --> 00:47:32.721
for this image.

00:47:32.721 --> 00:47:34.610
And because everything in
this model is differentiable,

00:47:34.610 --> 00:47:35.998
because we can backpropagate

00:47:35.998 --> 00:47:38.161
through all these soft attention steps,

00:47:38.161 --> 00:47:39.592
all of this soft attention stuff

00:47:39.592 --> 00:47:42.541
just comes out through
the training process.

00:47:42.541 --> 00:47:45.297
So that's really, really cool.

00:47:45.297 --> 00:47:47.345
By the way, this idea of
recurrent neural networks

00:47:47.345 --> 00:47:49.975
and attention actually
gets used in other tasks

00:47:49.975 --> 00:47:51.565
beyond image captioning.

00:47:51.565 --> 00:47:53.169
One recent example is this idea

00:47:53.169 --> 00:47:54.691
of visual question answering.

00:47:54.691 --> 00:47:58.195
So here, our model is going
to take two things as input.

00:47:58.195 --> 00:47:59.259
It's going to take an image

00:47:59.259 --> 00:48:01.608
and it will also take a
natural language question

00:48:01.608 --> 00:48:04.010
that's asking some
question about the image.

00:48:04.010 --> 00:48:06.122
Here, we might see this image on the left

00:48:06.122 --> 00:48:07.536
and we might ask the question,

00:48:07.536 --> 00:48:10.163
what endangered animal
is featured on the truck?

00:48:10.163 --> 00:48:11.915
And now the model needs to select from one

00:48:11.915 --> 00:48:13.835
of these four natural language answers

00:48:13.835 --> 00:48:17.327
about which of these answers
correctly answers that question

00:48:17.327 --> 00:48:19.027
in the context of the image.

00:48:19.027 --> 00:48:21.586
So you can imagine kind of
stitching this model together

00:48:21.586 --> 00:48:24.774
using CNNs and RNNs in
kind of a natural way.

00:48:24.774 --> 00:48:26.274
Now, we're in this

00:48:28.125 --> 00:48:29.701
many to one scenario,

00:48:29.701 --> 00:48:31.978
where now our model needs to take as input

00:48:31.978 --> 00:48:33.566
this natural language sequence,

00:48:33.566 --> 00:48:35.552
so we can imagine running
a recurrent neural network

00:48:35.552 --> 00:48:37.485
over each element of that input question,

00:48:37.485 --> 00:48:40.111
to now summarize the input
question in a single vector.

00:48:40.111 --> 00:48:43.928
And then we can have a CNN
to again summarize the image,

00:48:43.928 --> 00:48:46.012
and now combine both
the vector from the CNN

00:48:46.012 --> 00:48:49.531
and the vector from the
question and coding RNN

00:48:49.531 --> 00:48:51.645
to then predict a
distribution over answers.

00:48:51.645 --> 00:48:52.786
We also sometimes,

00:48:52.786 --> 00:48:54.416
you'll also sometimes see this idea

00:48:54.416 --> 00:48:56.275
of soft spacial attention
being incorporated

00:48:56.275 --> 00:48:59.175
into things like visual
question answering.

00:48:59.175 --> 00:49:00.394
So you can see that here,

00:49:00.394 --> 00:49:03.956
this model is also having
the spatial attention

00:49:03.956 --> 00:49:05.342
over the image when it's trying

00:49:05.342 --> 00:49:07.062
to determine answers to the questions.

00:49:07.062 --> 00:49:09.062
Just to, yeah, question?

00:49:18.715 --> 00:49:19.548
So the question is

00:49:19.548 --> 00:49:20.878
How are the different inputs combined?

00:49:20.878 --> 00:49:23.116
Do you mean like the
encoded question vector

00:49:23.116 --> 00:49:25.998
and the encoded image vector?

00:49:25.998 --> 00:49:27.188
Yeah, so the question is

00:49:27.188 --> 00:49:28.445
how are the encoded image

00:49:28.445 --> 00:49:30.395
and the encoded question vector combined?

00:49:30.395 --> 00:49:31.756
Kind of the simplest thing to do

00:49:31.756 --> 00:49:32.885
is just to concatenate them

00:49:32.885 --> 00:49:34.847
and stick them into
fully connected layers.

00:49:34.847 --> 00:49:35.883
That's probably the most common

00:49:35.883 --> 00:49:37.864
and that's probably
the first thing to try.

00:49:37.864 --> 00:49:39.396
Sometimes people do
slightly fancier things

00:49:39.396 --> 00:49:41.673
where they might try to have
multiplicative interactions

00:49:41.673 --> 00:49:42.885
between those two vectors

00:49:42.885 --> 00:49:44.389
to allow a more powerful function.

00:49:44.389 --> 00:49:46.467
But generally, concatenation
is kind of a good

00:49:46.467 --> 00:49:48.050
first thing to try.

00:49:49.426 --> 00:49:51.727
Okay, so now we've talked
about a bunch of scenarios

00:49:51.727 --> 00:49:54.083
where RNNs are used for
different kinds of problems.

00:49:54.083 --> 00:49:55.084
And I think it's super cool

00:49:55.084 --> 00:49:56.758
because it allows you to

00:49:56.758 --> 00:49:58.855
start tackling really complicated problems

00:49:58.855 --> 00:50:02.386
combining images and computer vision

00:50:02.386 --> 00:50:04.072
with natural language processing.

00:50:04.072 --> 00:50:05.794
And you can see that we
can kind of stith together

00:50:05.794 --> 00:50:06.776
these models like Lego blocks

00:50:06.776 --> 00:50:08.870
and attack really complicated things,

00:50:08.870 --> 00:50:11.369
Like image captioning or
visual question answering

00:50:11.369 --> 00:50:14.069
just by stitching together
these relatively simple

00:50:14.069 --> 00:50:16.736
types of neural network modules.

00:50:18.708 --> 00:50:20.861
But I'd also like to mention that

00:50:20.861 --> 00:50:22.359
so far, we've talked about this idea

00:50:22.359 --> 00:50:24.060
of a single recurrent network layer,

00:50:24.060 --> 00:50:26.274
where we have sort of one hidden state,

00:50:26.274 --> 00:50:29.404
and another thing that
you'll see pretty commonly

00:50:29.404 --> 00:50:32.581
is this idea of a multilayer
recurrent neural network.

00:50:32.581 --> 00:50:36.577
Here, this is a three layer
recurrent neural network,

00:50:36.577 --> 00:50:38.535
so now our input goes in,

00:50:38.535 --> 00:50:42.292
goes into, goes in and produces
a sequence of hidden states

00:50:42.292 --> 00:50:44.554
from the first recurrent
neural network layer.

00:50:44.554 --> 00:50:46.309
And now, after we run kind of

00:50:46.309 --> 00:50:47.893
one recurrent neural network layer,

00:50:47.893 --> 00:50:50.468
then we have this whole
sequence of hidden states.

00:50:50.468 --> 00:50:52.991
And now, we can use the
sequence of hidden states

00:50:52.991 --> 00:50:54.733
as an input sequence to another

00:50:54.733 --> 00:50:56.474
recurrent neural network layer.

00:50:56.474 --> 00:50:57.801
And then you can just imagine,

00:50:57.801 --> 00:50:59.757
which will then produce another
sequence of hidden states

00:50:59.757 --> 00:51:01.577
from the second RNN layer.

00:51:01.577 --> 00:51:02.410
And then you can just imagine

00:51:02.410 --> 00:51:04.181
stacking these things
on top of each other,

00:51:04.181 --> 00:51:06.108
cause we know that we've
seen in other contexts

00:51:06.108 --> 00:51:08.253
that deeper models tend to perform better

00:51:08.253 --> 00:51:09.398
for various problems.

00:51:09.398 --> 00:51:11.851
And the same kind of
holds in RNNs as well.

00:51:11.851 --> 00:51:14.377
For many problems, you'll see maybe a two

00:51:14.377 --> 00:51:16.714
or three layer recurrent
neural network model

00:51:16.714 --> 00:51:18.215
is pretty commonly used.

00:51:18.215 --> 00:51:22.086
You typically don't see
super deep models in RNNs.

00:51:22.086 --> 00:51:25.449
So generally, like two,
three, four layer RNNs

00:51:25.449 --> 00:51:29.327
is maybe as deep as you'll typically go.

00:51:29.327 --> 00:51:32.231
Then, I think it's also really
interesting and important

00:51:32.231 --> 00:51:33.500
to think about,

00:51:33.500 --> 00:51:34.714
now we've seen kind of

00:51:34.714 --> 00:51:38.222
what kinds of problems
these RNNs can be used for,

00:51:38.222 --> 00:51:40.079
but then you need to think
a little bit more carefully

00:51:40.079 --> 00:51:41.795
about exactly what happens to these models

00:51:41.795 --> 00:51:43.229
when we try to train them.

00:51:43.229 --> 00:51:45.704
So here, I've drawn this
little vanilla RNN cell

00:51:45.704 --> 00:51:47.447
that we've talked about so far.

00:51:47.447 --> 00:51:50.639
So here, we're taking
our current input, x t,

00:51:50.639 --> 00:51:52.971
and our previous hidden
state, h t minus one,

00:51:52.971 --> 00:51:55.307
and then we stack, those are two vectors.

00:51:55.307 --> 00:51:57.359
So we can just stack them together.

00:51:57.359 --> 00:51:59.033
And then perform this
matrix multiplication

00:51:59.033 --> 00:52:00.431
with our weight matrix,

00:52:00.431 --> 00:52:01.911
to give our,

00:52:01.911 --> 00:52:03.798
and then squash that
output through a tanh,

00:52:03.798 --> 00:52:05.783
and that will give us
our next hidden state.

00:52:05.783 --> 00:52:07.687
And that's kind of the
basic functional form

00:52:07.687 --> 00:52:10.131
of this vanilla recurrent neural network.

00:52:10.131 --> 00:52:11.347
But then, we need to think about

00:52:11.347 --> 00:52:14.080
what happens in this architecture

00:52:14.080 --> 00:52:17.738
during the backward pass when
we try to compute gradients?

00:52:17.738 --> 00:52:19.435
So then if we think
about trying to compute,

00:52:19.435 --> 00:52:21.426
so then during the backwards pass,

00:52:21.426 --> 00:52:24.759
we'll receive the derivative of our h t,

00:52:25.850 --> 00:52:27.075
we'll receive derivative of loss

00:52:27.075 --> 00:52:28.568
with respect to h t.

00:52:28.568 --> 00:52:30.885
And during the backward
pass through the cell,

00:52:30.885 --> 00:52:32.720
we'll need to compute derivative of loss

00:52:32.720 --> 00:52:34.632
to the respect of h t minus one.

00:52:34.632 --> 00:52:37.026
Then, when we compute this backward pass,

00:52:37.026 --> 00:52:38.601
we see that the gradient flows backward

00:52:38.601 --> 00:52:39.881
through this red path.

00:52:39.881 --> 00:52:41.706
So first, that gradient
will flow backwards

00:52:41.706 --> 00:52:43.068
through this tanh gate,

00:52:43.068 --> 00:52:44.036
and then it will flow backwards

00:52:44.036 --> 00:52:45.958
through this matrix multiplication gate.

00:52:45.958 --> 00:52:48.644
And then, as we've seen in the homework

00:52:48.644 --> 00:52:52.054
and when implementing these
matrix multiplication layers,

00:52:52.054 --> 00:52:53.252
when you backpropagate through

00:52:53.252 --> 00:52:54.711
this matrix multiplication gate,

00:52:54.711 --> 00:52:56.785
you end up mulitplying by the transpose

00:52:56.785 --> 00:52:58.457
of that weight matrix.

00:52:58.457 --> 00:53:00.642
So that means that every
time we backpropagate

00:53:00.642 --> 00:53:03.857
through one of these vanilla RNN cells,

00:53:03.857 --> 00:53:08.024
we end up multiplying by some
part of the weight matrix.

00:53:09.110 --> 00:53:11.124
So now if you imagine
that we are sticking many

00:53:11.124 --> 00:53:13.834
of these recurrent neural
network cells in sequence,

00:53:13.834 --> 00:53:15.413
because again this is an RNN.

00:53:15.413 --> 00:53:16.599
We want a model sequences.

00:53:16.599 --> 00:53:18.976
Now if you imagine what
happens to the gradient flow

00:53:18.976 --> 00:53:20.921
through a sequence of these layers,

00:53:20.921 --> 00:53:23.770
then something kind of
fishy starts to happen.

00:53:23.770 --> 00:53:25.207
Because now, when we want to compute

00:53:25.207 --> 00:53:27.863
the gradient of the loss
with respect to h zero,

00:53:27.863 --> 00:53:29.440
we need to backpropagate through every one

00:53:29.440 --> 00:53:30.810
of these RNN cells.

00:53:30.810 --> 00:53:32.903
And every time you
backpropagate through one cell,

00:53:32.903 --> 00:53:35.641
you'll pick up one of
these w transpose factors.

00:53:35.641 --> 00:53:38.176
So that means that the final expression

00:53:38.176 --> 00:53:40.289
for the gradient on h zero

00:53:40.289 --> 00:53:42.161
will involve many, many factors

00:53:42.161 --> 00:53:43.550
of this weight matrix,

00:53:43.550 --> 00:53:44.967
which could be kind of bad.

00:53:44.967 --> 00:53:46.976
Maybe don't think about the weight,

00:53:46.976 --> 00:53:48.329
the matrix case,

00:53:48.329 --> 00:53:50.138
but imagine a scaler case.

00:53:50.138 --> 00:53:51.877
If we end up, if we have some scaler

00:53:51.877 --> 00:53:53.662
and we multiply by that
same number over and over

00:53:53.662 --> 00:53:54.563
and over again,

00:53:54.563 --> 00:53:56.038
maybe not for four examples,

00:53:56.038 --> 00:53:57.133
but for something like a hundred

00:53:57.133 --> 00:54:00.269
or several hundred time steps,

00:54:00.269 --> 00:54:01.832
then multiplying by the same number

00:54:01.832 --> 00:54:03.835
over and over again is really bad.

00:54:03.835 --> 00:54:05.244
In the scaler case,

00:54:05.244 --> 00:54:06.756
it's either going to explode

00:54:06.756 --> 00:54:08.971
in the case that that
number is greater than one

00:54:08.971 --> 00:54:10.504
or it's going to vanish towards zero

00:54:10.504 --> 00:54:12.912
in the case that number is less than one

00:54:12.912 --> 00:54:14.069
in absolute value.

00:54:14.069 --> 00:54:16.355
And the only way in which
this will not happen

00:54:16.355 --> 00:54:18.186
is if that number is exactly one,

00:54:18.186 --> 00:54:20.911
which is actually very
rare to happen in practice.

00:54:20.911 --> 00:54:22.569
That leaves us to,

00:54:22.569 --> 00:54:25.229
that same intuition
extends to the matrix case,

00:54:25.229 --> 00:54:27.560
but now, rather than the absolute
value of a scaler number,

00:54:27.560 --> 00:54:29.261
you instead need to look at the largest,

00:54:29.261 --> 00:54:32.071
the largest singular value
of this weight matrix.

00:54:32.071 --> 00:54:34.707
Now if that largest singular
value is greater than one,

00:54:34.707 --> 00:54:36.502
then during this backward pass,

00:54:36.502 --> 00:54:38.824
when we multiply by the
weight matrix over and over,

00:54:38.824 --> 00:54:42.135
that gradient on h w, on h zero, sorry,

00:54:42.135 --> 00:54:43.915
will become very, very large,

00:54:43.915 --> 00:54:46.079
when that matrix is too large.

00:54:46.079 --> 00:54:48.947
And that's something we call
the exploding gradient problem.

00:54:48.947 --> 00:54:52.047
Where now this gradient will
explode exponentially in depth

00:54:52.047 --> 00:54:53.427
with the number of time steps

00:54:53.427 --> 00:54:55.061
that we backpropagate through.

00:54:55.061 --> 00:54:57.643
And if the largest singular
value is less than one,

00:54:57.643 --> 00:54:59.123
then we get the opposite problem,

00:54:59.123 --> 00:55:00.036
where now our gradients will shrink

00:55:00.036 --> 00:55:01.563
and shrink and shrink exponentially,

00:55:01.563 --> 00:55:03.945
as we backpropagate and pick
up more and more factors

00:55:03.945 --> 00:55:05.349
of this weight matrix.

00:55:05.349 --> 00:55:08.863
That's called the
vanishing gradient problem.

00:55:08.863 --> 00:55:11.172
THere's a bit of a hack
that people sometimes do

00:55:11.172 --> 00:55:12.836
to fix the exploding gradient problem

00:55:12.836 --> 00:55:14.208
called gradient clipping,

00:55:14.208 --> 00:55:16.264
which is just this simple heuristic

00:55:16.264 --> 00:55:18.660
saying that after we compute our gradient,

00:55:18.660 --> 00:55:19.713
if that gradient,

00:55:19.713 --> 00:55:22.156
if it's L2 norm is above some threshold,

00:55:22.156 --> 00:55:24.112
then just clamp it down and divide,

00:55:24.112 --> 00:55:27.955
just clamp it down so it
has this maximum threshold.

00:55:27.955 --> 00:55:29.271
This is kind of a nasty hack,

00:55:29.271 --> 00:55:30.964
but it actually gets used
in practice quite a lot

00:55:30.964 --> 00:55:32.645
when training recurrent neural networks.

00:55:32.645 --> 00:55:34.517
And it's a relatively useful tool

00:55:34.517 --> 00:55:39.034
for attacking this
exploding gradient problem.

00:55:39.034 --> 00:55:40.994
But now for the vanishing
gradient problem,

00:55:40.994 --> 00:55:42.254
what we typically do

00:55:42.254 --> 00:55:43.640
is we might need to move to

00:55:43.640 --> 00:55:46.207
a more complicated RNN architecture.

00:55:46.207 --> 00:55:48.939
So that motivates this idea of an LSTM.

00:55:48.939 --> 00:55:53.524
An LSTM, which stands for
Long Short Term Memory,

00:55:53.524 --> 00:55:55.700
is this slightly fancier
recurrence relation

00:55:55.700 --> 00:55:58.316
for these recurrent neural networks.

00:55:58.316 --> 00:55:59.984
It's really designed to help alleviate

00:55:59.984 --> 00:56:03.330
this problem of vanishing
and exploding gradients.

00:56:03.330 --> 00:56:05.591
So that rather than kind
of hacking on top of it,

00:56:05.591 --> 00:56:07.794
we just kind of design the architecture

00:56:07.794 --> 00:56:10.193
to have better gradient flow properties.

00:56:10.193 --> 00:56:13.566
Kind of an analogy to those
fancier CNN architectures

00:56:13.566 --> 00:56:16.556
that we saw at the top of the lecture.

00:56:16.556 --> 00:56:17.389
Another thing to point out

00:56:17.389 --> 00:56:20.807
is that the LSTM cell
actually comes from 1997.

00:56:20.807 --> 00:56:22.337
So this idea of an LSTM

00:56:22.337 --> 00:56:24.073
has been around for quite a while,

00:56:24.073 --> 00:56:26.108
and these folks were
working on these ideas

00:56:26.108 --> 00:56:27.110
way back in the 90s,

00:56:27.110 --> 00:56:28.852
were definitely ahead of the curve.

00:56:28.852 --> 00:56:31.225
Because these models are
kind of used everywhere now

00:56:31.225 --> 00:56:32.475
20 years later.

00:56:33.864 --> 00:56:37.921
And LSTMs kind of have
this funny functional form.

00:56:37.921 --> 00:56:39.814
So remember when we had this vanilla

00:56:39.814 --> 00:56:41.198
recurrent neural network,

00:56:41.198 --> 00:56:42.538
it had this hidden state.

00:56:42.538 --> 00:56:44.056
And we used this recurrence relation

00:56:44.056 --> 00:56:46.397
to update the hidden
state at every time step.

00:56:46.397 --> 00:56:47.570
Well, now in an LSTM,

00:56:47.570 --> 00:56:48.842
we actually have two,

00:56:48.842 --> 00:56:51.462
we maintain two hidden
states at every time step.

00:56:51.462 --> 00:56:52.931
One is this h t,

00:56:52.931 --> 00:56:54.439
which is called the hidden state,

00:56:54.439 --> 00:56:57.334
which is kind of an
analogy to the hidden state

00:56:57.334 --> 00:56:59.305
that we had in the vanilla RNN.

00:56:59.305 --> 00:57:02.524
But an LSTM also maintains
the second vector, c t,

00:57:02.524 --> 00:57:03.604
called the cell state.

00:57:03.604 --> 00:57:06.459
And the cell state is this
vector which is kind of internal,

00:57:06.459 --> 00:57:08.546
kept inside the LSTM,

00:57:08.546 --> 00:57:12.240
and it does not really get
exposed to the outside world.

00:57:12.240 --> 00:57:13.170
And we'll see,

00:57:13.170 --> 00:57:15.260
and you can kind of see that
through this update equation,

00:57:15.260 --> 00:57:17.371
where you can see that when we,

00:57:17.371 --> 00:57:19.235
first when we compute these,

00:57:19.235 --> 00:57:20.803
we take our two inputs,

00:57:20.803 --> 00:57:23.966
we use them to compute these four gates

00:57:23.966 --> 00:57:25.485
called i, f, o, n, g.

00:57:25.485 --> 00:57:28.634
We use those gates to
update our cell states, c t,

00:57:28.634 --> 00:57:30.302
and then we expose part of our cell state

00:57:30.302 --> 00:57:33.802
as the hidden state at the next time step.

00:57:36.704 --> 00:57:38.282
This is kind of a funny functional form,

00:57:38.282 --> 00:57:40.227
and I want to walk through
for a couple slides

00:57:40.227 --> 00:57:42.407
exactly why do we use this architecture

00:57:42.407 --> 00:57:44.014
and why does it make sense,

00:57:44.014 --> 00:57:45.418
especially in the context

00:57:45.418 --> 00:57:47.731
of vanishing or exploding gradients.

00:57:47.731 --> 00:57:51.044
This first thing that we do in an LSTM

00:57:51.044 --> 00:57:54.534
is that we're given this
previous hidden state, h t,

00:57:54.534 --> 00:57:57.213
and we're given our
current input vector, x t,

00:57:57.213 --> 00:57:58.611
and just like the vanilla RNN.

00:57:58.611 --> 00:58:00.251
In the vanilla RNN, remember,

00:58:00.251 --> 00:58:02.617
we took those two input vectors.

00:58:02.617 --> 00:58:03.908
We concatenated them.

00:58:03.908 --> 00:58:05.098
Then we did a matrix multiply

00:58:05.098 --> 00:58:08.663
to directly compute the next
hidden state in the RNN.

00:58:08.663 --> 00:58:10.751
Now, the LSTM does something
a little bit different.

00:58:10.751 --> 00:58:13.055
We're going to take our
previous hidden state

00:58:13.055 --> 00:58:14.429
and our current input,

00:58:14.429 --> 00:58:15.315
stack them,

00:58:15.315 --> 00:58:18.748
and now multiply by a
very big weight matrix, w,

00:58:18.748 --> 00:58:21.926
to compute four different gates,

00:58:21.926 --> 00:58:24.391
Which all have the same
size as the hidden state.

00:58:24.391 --> 00:58:26.193
Sometimes, you'll see this
written in different ways.

00:58:26.193 --> 00:58:29.549
Some authors will write
a different weight matrix

00:58:29.549 --> 00:58:30.808
for each gate.

00:58:30.808 --> 00:58:31.793
Some authors will combine them all

00:58:31.793 --> 00:58:33.160
into one big weight matrix.

00:58:33.160 --> 00:58:34.677
But it's all really the same thing.

00:58:34.677 --> 00:58:36.334
The ideas is that we
take our hidden state,

00:58:36.334 --> 00:58:37.282
our current input,

00:58:37.282 --> 00:58:40.288
and then we use those to
compute these four gates.

00:58:40.288 --> 00:58:42.163
These four gates are the,

00:58:42.163 --> 00:58:45.725
you often see this written
as i, f, o, g, ifog,

00:58:45.725 --> 00:58:47.768
which makes it pretty easy
to remember what they are.

00:58:47.768 --> 00:58:49.435
I is the input gate.

00:58:50.324 --> 00:58:53.655
It says how much do we want
to input into our cell.

00:58:53.655 --> 00:58:55.431
F is the forget gate.

00:58:55.431 --> 00:58:57.860
How much do we want to
forget the cell memory

00:58:57.860 --> 00:59:00.194
at the previous, from
the previous time step.

00:59:00.194 --> 00:59:01.333
O is the output gate,

00:59:01.333 --> 00:59:03.327
which is how much do we
want to reveal ourself

00:59:03.327 --> 00:59:04.653
to the outside world.

00:59:04.653 --> 00:59:07.092
And G really doesn't have a nice name,

00:59:07.092 --> 00:59:10.130
so I usually call it the gate gate.

00:59:10.130 --> 00:59:13.109
G, it tells us how much
do we want to write

00:59:13.109 --> 00:59:14.626
into our input cell.

00:59:14.626 --> 00:59:16.665
And then you notice that
each of these four gates

00:59:16.665 --> 00:59:19.665
are using a different non linearity.

00:59:21.724 --> 00:59:23.809
The input, forget and output gate

00:59:23.809 --> 00:59:25.047
are all using sigmoids,

00:59:25.047 --> 00:59:28.571
which means that their values
will be between zero and one.

00:59:28.571 --> 00:59:31.109
Whereas the gate gate uses a tanh,

00:59:31.109 --> 00:59:34.316
which means it's output will
be between minus one and one.

00:59:34.316 --> 00:59:36.810
So, these are kind of weird,

00:59:36.810 --> 00:59:38.551
but it makes a little bit more sense

00:59:38.551 --> 00:59:41.725
if you imagine them all as binary values.

00:59:41.725 --> 00:59:43.077
Right, like what happens at the extremes

00:59:43.077 --> 00:59:44.744
of these two values?

00:59:46.033 --> 00:59:46.866
It's kind of what happens,

00:59:46.866 --> 00:59:48.490
if you look after we compute these gates

00:59:48.490 --> 00:59:50.200
if you look at this next equation,

00:59:50.200 --> 00:59:51.423
you can see that our cell state

00:59:51.423 --> 00:59:53.926
is being multiplied element
wise by the forget gate.

00:59:53.926 --> 00:59:56.269
Sorry, our cell state from
the previous time step

00:59:56.269 --> 00:59:59.305
is being multiplied element
wise by this forget gate.

00:59:59.305 --> 01:00:00.350
And now if this forget gate,

01:00:00.350 --> 01:00:03.861
you can think of it as being
a vector of zeros and ones,

01:00:03.861 --> 01:00:06.162
that's telling us for each
element in the cell state,

01:00:06.162 --> 01:00:08.416
do we want to forget
that element of the cell

01:00:08.416 --> 01:00:10.644
in the case if the forget gate was zero?

01:00:10.644 --> 01:00:12.819
Or do we want to remember
that element of the cell

01:00:12.819 --> 01:00:14.935
in the case if the forget gate was one.

01:00:14.935 --> 01:00:17.258
Now, once we've used the forget gate

01:00:17.258 --> 01:00:19.767
to gate off the part of the cell state,

01:00:19.767 --> 01:00:21.088
then we have the second term,

01:00:21.088 --> 01:00:24.814
which is the element
wise product of i and g.

01:00:24.814 --> 01:00:27.431
So now, i is this vector
of zeros and ones,

01:00:27.431 --> 01:00:28.824
cause it's coming through a sigmoid,

01:00:28.824 --> 01:00:31.480
telling us for each
element of the cell state,

01:00:31.480 --> 01:00:33.909
do we want to write to that
element of the cell state

01:00:33.909 --> 01:00:35.728
in the case that i is one,

01:00:35.728 --> 01:00:37.771
or do we not want to write to
that element of the cell state

01:00:37.771 --> 01:00:38.687
at this time step

01:00:38.687 --> 01:00:41.719
in the case that i is zero.

01:00:41.719 --> 01:00:42.585
And now the gate gate,

01:00:42.585 --> 01:00:44.254
because it's coming through a tanh,

01:00:44.254 --> 01:00:46.031
will be either one or minus one.

01:00:46.031 --> 01:00:47.495
So that is the value that we want,

01:00:47.495 --> 01:00:50.500
the candidate value that
we might consider writing

01:00:50.500 --> 01:00:54.022
to each element of the cell
state at this time step.

01:00:54.022 --> 01:00:56.098
Then if you look at the
cell state equation,

01:00:56.098 --> 01:00:58.157
you can see that at every time step,

01:00:58.157 --> 01:00:59.963
the cell state has these kind of

01:00:59.963 --> 01:01:02.499
these different,
independent scaler values,

01:01:02.499 --> 01:01:05.230
and they're all being incremented
or decremented by one.

01:01:05.230 --> 01:01:07.466
So there's kind of like,

01:01:07.466 --> 01:01:09.560
inside the cell state,
we can either remember

01:01:09.560 --> 01:01:11.028
or forget our previous state,

01:01:11.028 --> 01:01:13.480
and then we can either
increment or decrement

01:01:13.480 --> 01:01:14.765
each element of that cell state

01:01:14.765 --> 01:01:16.535
by up to one at each time step.

01:01:16.535 --> 01:01:19.430
So you can kind of think of
these elements of the cell state

01:01:19.430 --> 01:01:22.189
as being little scaler integer counters

01:01:22.189 --> 01:01:24.061
that can be incremented and decremented

01:01:24.061 --> 01:01:25.708
at each time step.

01:01:25.708 --> 01:01:28.489
And now, after we've
computed our cell state,

01:01:28.489 --> 01:01:31.624
then we use our now updated cell state

01:01:31.624 --> 01:01:32.874
to compute a hidden state,

01:01:32.874 --> 01:01:36.513
which we will reveal to the outside world.

01:01:36.513 --> 01:01:38.832
So because this cell state
has this interpretation

01:01:38.832 --> 01:01:39.884
of being counters,

01:01:39.884 --> 01:01:41.280
and sort of counting up by one

01:01:41.280 --> 01:01:43.353
or minus one at each time step,

01:01:43.353 --> 01:01:45.518
we want to squash that counter value

01:01:45.518 --> 01:01:48.895
into a nice zero to
one range using a tanh.

01:01:48.895 --> 01:01:50.483
And now, we multiply element wise,

01:01:50.483 --> 01:01:51.826
by this output gate.

01:01:51.826 --> 01:01:54.441
And the output gate is again
coming through a sigmoid,

01:01:54.441 --> 01:01:57.597
so you can think of it as
being mostly zeros and ones,

01:01:57.597 --> 01:01:58.947
and the output gate tells us

01:01:58.947 --> 01:02:00.826
for each element of our cell state,

01:02:00.826 --> 01:02:02.773
do we want to reveal or not reveal

01:02:02.773 --> 01:02:04.614
that element of our cell state

01:02:04.614 --> 01:02:06.994
when we're computing the
external hidden state

01:02:06.994 --> 01:02:08.577
for this time step.

01:02:09.736 --> 01:02:11.609
And then, I think there's
kind of a tradition

01:02:11.609 --> 01:02:13.479
in people trying to explain LSTMs,

01:02:13.479 --> 01:02:14.524
that everyone needs to come up

01:02:14.524 --> 01:02:17.132
with their own potentially
confusing LSTM diagram.

01:02:17.132 --> 01:02:18.882
So here's my attempt.

01:02:20.380 --> 01:02:23.866
Here, we can see what's going
on inside this LSTM cell,

01:02:23.866 --> 01:02:24.865
is that we take our,

01:02:24.865 --> 01:02:27.566
we're taking as input on the
left our previous cell state

01:02:27.566 --> 01:02:28.937
and the previous hidden state,

01:02:28.937 --> 01:02:31.266
as well as our current input, x t.

01:02:31.266 --> 01:02:32.537
Now we're going to take our current,

01:02:32.537 --> 01:02:34.985
our previous hidden state,

01:02:34.985 --> 01:02:36.453
as well as our current input,

01:02:36.453 --> 01:02:37.346
stack them,

01:02:37.346 --> 01:02:39.526
and then multiply with
this weight matrix, w,

01:02:39.526 --> 01:02:41.166
to produce our four gates.

01:02:41.166 --> 01:02:42.627
And here, I've left
out the non linearities

01:02:42.627 --> 01:02:44.836
because we saw those on a previous slide.

01:02:44.836 --> 01:02:47.294
And now the forget gate
multiplies element wise

01:02:47.294 --> 01:02:48.143
with the cell state.

01:02:48.143 --> 01:02:51.174
The input and gate gate
are multiplied element wise

01:02:51.174 --> 01:02:52.689
and added to the cell state.

01:02:52.689 --> 01:02:54.524
And that gives us our next cell.

01:02:54.524 --> 01:02:56.533
The next cell gets
squashed through a tanh,

01:02:56.533 --> 01:02:58.616
and multiplied element
wise with this output gate

01:02:58.616 --> 01:03:01.366
to produce our next hidden state.

01:03:02.417 --> 01:03:03.250
Question?

01:03:13.116 --> 01:03:14.587
No, So they're coming through this,

01:03:14.587 --> 01:03:17.878
they're coming from different
parts of this weight matrix.

01:03:17.878 --> 01:03:19.147
So if our hidden,

01:03:19.147 --> 01:03:23.343
if our x and our h all
have this dimension h,

01:03:23.343 --> 01:03:24.300
then after we stack them,

01:03:24.300 --> 01:03:26.415
they'll be a vector size two h,

01:03:26.415 --> 01:03:28.718
and now our weight matrix
will be this matrix

01:03:28.718 --> 01:03:30.393
of size four h times two h.

01:03:30.393 --> 01:03:31.873
So you can think of that as sort of having

01:03:31.873 --> 01:03:34.076
four chunks of this weight matrix.

01:03:34.076 --> 01:03:37.344
And each of these four
chunks of the weight matrix

01:03:37.344 --> 01:03:41.511
is going to compute a
different one of these gates.

01:03:42.404 --> 01:03:44.673
You'll often see this written for clarity,

01:03:44.673 --> 01:03:46.449
kind of combining all
four of those different

01:03:46.449 --> 01:03:49.161
weight matrices into a
single large matrix, w,

01:03:49.161 --> 01:03:51.109
just for notational convenience.

01:03:51.109 --> 01:03:52.484
But they're all computed

01:03:52.484 --> 01:03:56.067
using different parts
of the weight matrix.

01:03:57.080 --> 01:03:58.645
But you're correct in
that they're all computed

01:03:58.645 --> 01:04:00.458
using the same functional form

01:04:00.458 --> 01:04:01.658
of just stacking the two things

01:04:01.658 --> 01:04:04.574
and taking the matrix multiplication.

01:04:04.574 --> 01:04:06.393
Now that we have this picture,

01:04:06.393 --> 01:04:09.519
we can think about what
happens to an LSTM cell

01:04:09.519 --> 01:04:11.196
during the backwards pass?

01:04:11.196 --> 01:04:12.795
We saw, in the context of vanilla

01:04:12.795 --> 01:04:13.738
recurrent neural network,

01:04:13.738 --> 01:04:15.803
that some bad things happened
during the backwards pass,

01:04:15.803 --> 01:04:16.958
where we were continually multiplying

01:04:16.958 --> 01:04:18.999
by that weight matrix, w.

01:04:18.999 --> 01:04:20.555
But now, the situation looks much,

01:04:20.555 --> 01:04:23.238
quite a bit different in the LSTM.

01:04:23.238 --> 01:04:26.468
If you imagine this path backwards

01:04:26.468 --> 01:04:28.323
of computing the gradients
of the cell state,

01:04:28.323 --> 01:04:29.952
we get quite a nice picture.

01:04:29.952 --> 01:04:32.266
Now, when we have our upstream gradient

01:04:32.266 --> 01:04:33.320
from the cell coming in,

01:04:33.320 --> 01:04:36.281
then once we backpropagate backwards

01:04:36.281 --> 01:04:37.737
through this addition operation,

01:04:37.737 --> 01:04:40.572
remember that this addition just copies

01:04:40.572 --> 01:04:43.663
that upstream gradient
into the two branches,

01:04:43.663 --> 01:04:45.641
so our upstream gradient
gets copied directly

01:04:45.641 --> 01:04:47.954
and passed directly to backpropagating

01:04:47.954 --> 01:04:50.435
through this element wise multiply.

01:04:50.435 --> 01:04:52.192
So then our upstream
gradient ends up getting

01:04:52.192 --> 01:04:56.455
multiplied element wise
by the forget gate.

01:04:56.455 --> 01:05:00.364
As we backpropagate backwards
through this cell state,

01:05:00.364 --> 01:05:01.397
the only thing that happens

01:05:01.397 --> 01:05:03.818
to our upstream cell state gradient

01:05:03.818 --> 01:05:05.935
is that it ends up getting
multiplied element wise

01:05:05.935 --> 01:05:07.171
by the forget gate.

01:05:07.171 --> 01:05:09.939
This is really a lot nicer

01:05:09.939 --> 01:05:12.640
than the vanilla RNN for two reasons.

01:05:12.640 --> 01:05:14.318
One is that this forget gate

01:05:14.318 --> 01:05:16.488
is now an element wise multiplication

01:05:16.488 --> 01:05:18.498
rather than a full matrix multiplication.

01:05:18.498 --> 01:05:19.923
So element wise multiplication

01:05:19.923 --> 01:05:23.205
is going to be a little bit nicer

01:05:23.205 --> 01:05:24.964
than full matrix multiplication.

01:05:24.964 --> 01:05:27.208
Second is that element wise multiplication

01:05:27.208 --> 01:05:29.710
will potentially be
multiplying by a different

01:05:29.710 --> 01:05:31.354
forget gate at every time step.

01:05:31.354 --> 01:05:33.087
So remember, in the vanilla RNN,

01:05:33.087 --> 01:05:35.638
we were continually multiplying
by that same weight matrix

01:05:35.638 --> 01:05:36.660
over and over again,

01:05:36.660 --> 01:05:38.305
which led very explicitly

01:05:38.305 --> 01:05:40.563
to these exploding or vanishing gradients.

01:05:40.563 --> 01:05:41.943
But now in the LSTM case,

01:05:41.943 --> 01:05:45.161
this forget gate can
vary from each time step.

01:05:45.161 --> 01:05:47.463
Now, it's much easier for the model

01:05:47.463 --> 01:05:49.560
to avoid these problems

01:05:49.560 --> 01:05:51.670
of exploding and vanishing gradients.

01:05:51.670 --> 01:05:53.377
Finally, because this forget gate

01:05:53.377 --> 01:05:54.902
is coming out from a sigmoid,

01:05:54.902 --> 01:05:56.178
this element wise multiply

01:05:56.178 --> 01:05:58.438
is guaranteed to be between zero and one,

01:05:58.438 --> 01:06:00.868
which again, leads to sort
of nicer numerical properties

01:06:00.868 --> 01:06:02.457
if you imagine multiplying by these things

01:06:02.457 --> 01:06:04.278
over and over again.

01:06:04.278 --> 01:06:07.063
Another thing to notice
is that in the context

01:06:07.063 --> 01:06:08.909
of the vanilla recurrent neural network,

01:06:08.909 --> 01:06:10.239
we saw that during the backward pass,

01:06:10.239 --> 01:06:13.146
our gradients were flowing
through also a tanh

01:06:13.146 --> 01:06:14.273
at every time step.

01:06:14.273 --> 01:06:15.940
But now, in an LSTM,

01:06:17.459 --> 01:06:18.792
our outputs are,

01:06:20.790 --> 01:06:22.452
in an LSTM, our hidden state is used

01:06:22.452 --> 01:06:24.110
to compute those outputs, y t,

01:06:24.110 --> 01:06:26.141
so now, each hidden state,

01:06:26.141 --> 01:06:29.229
if you imagine backpropagating
from the final hidden state

01:06:29.229 --> 01:06:31.290
back to the first cell state,

01:06:31.290 --> 01:06:33.024
then through that backward path,

01:06:33.024 --> 01:06:37.396
we only backpropagate through
a single tanh non linearity

01:06:37.396 --> 01:06:42.044
rather than through a separate
tanh at every time step.

01:06:42.044 --> 01:06:44.059
So kind of when you put
all these things together,

01:06:44.059 --> 01:06:45.927
you can see this backwards pass

01:06:45.927 --> 01:06:48.646
backpropagating through the cell state

01:06:48.646 --> 01:06:50.483
is kind of a gradient super highway

01:06:50.483 --> 01:06:53.205
that lets gradients pass
relatively unimpeded

01:06:53.205 --> 01:06:55.047
from the loss at the very end of the model

01:06:55.047 --> 01:06:56.765
all the way back to the initial cell state

01:06:56.765 --> 01:06:59.172
at the beginning of the model.

01:06:59.172 --> 01:07:00.922
Was there a question?

01:07:02.901 --> 01:07:04.693
Yeah, what about the
gradient in respect to w?

01:07:04.693 --> 01:07:06.792
'Cause that's ultimately the
thing that we care about.

01:07:06.792 --> 01:07:08.752
So, the gradient with respect to w

01:07:08.752 --> 01:07:10.252
will come through,

01:07:11.737 --> 01:07:12.570
at every time step,

01:07:12.570 --> 01:07:13.771
will take our current cell state

01:07:13.771 --> 01:07:15.059
as well as our current hidden state

01:07:15.059 --> 01:07:16.272
and that will give us an element,

01:07:16.272 --> 01:07:18.480
that will give us our local gradient on w

01:07:18.480 --> 01:07:19.848
for that time step.

01:07:19.848 --> 01:07:21.340
So because our cell state,

01:07:21.340 --> 01:07:23.791
and just in the vanilla RNN case,

01:07:23.791 --> 01:07:27.307
we'll end up adding those
first time step w gradients

01:07:27.307 --> 01:07:29.587
to compute our final gradient on w.

01:07:29.587 --> 01:07:33.139
But now, if you imagine the situation

01:07:33.139 --> 01:07:34.961
where we have a very long sequence,

01:07:34.961 --> 01:07:36.405
and we're only getting
gradients to the very end

01:07:36.405 --> 01:07:37.280
of the sequence.

01:07:37.280 --> 01:07:38.945
Now, as you backpropagate through,

01:07:38.945 --> 01:07:40.978
we'll get a local gradient on w

01:07:40.978 --> 01:07:43.219
for each time step,

01:07:43.219 --> 01:07:44.484
and that local gradient on w

01:07:44.484 --> 01:07:48.506
will be coming through
these gradients on c and h.

01:07:48.506 --> 01:07:50.994
So because we're maintaining
the gradients on c

01:07:50.994 --> 01:07:52.751
much more nicely in the LSTM case,

01:07:52.751 --> 01:07:54.988
those local gradients
on w at each time step

01:07:54.988 --> 01:07:57.221
will also be carried forward and backward

01:07:57.221 --> 01:07:59.804
through time much more cleanly.

01:08:01.627 --> 01:08:03.044
Another question?

01:08:17.428 --> 01:08:18.645
Yeah, so the question is

01:08:18.645 --> 01:08:19.886
due to the non linearities,

01:08:19.886 --> 01:08:22.088
could this still be susceptible
to vanishing gradients?

01:08:22.089 --> 01:08:24.077
And that could be the case.

01:08:24.077 --> 01:08:26.176
Actually, so one problem you might imagine

01:08:26.176 --> 01:08:27.560
is that maybe if these forget gates

01:08:27.560 --> 01:08:29.322
are always less than zero,

01:08:29.323 --> 01:08:30.252
or always less than one,

01:08:30.252 --> 01:08:31.411
you might get vanishing gradients

01:08:31.411 --> 01:08:34.103
as you continually go
through these forget gates.

01:08:34.103 --> 01:08:35.960
Well, one sort of trick
that people do in practice

01:08:35.960 --> 01:08:38.513
is that they will, sometimes,

01:08:38.513 --> 01:08:40.689
initialize the biases of the forget gate

01:08:40.689 --> 01:08:42.746
to be somewhat positive.

01:08:42.746 --> 01:08:44.004
So that at the beginning of training,

01:08:44.004 --> 01:08:46.305
those forget gates are
always very close to one.

01:08:46.305 --> 01:08:48.118
So that at least at the
beginning of training,

01:08:48.118 --> 01:08:52.285
then we have not so,
relatively clean gradient flow

01:08:53.265 --> 01:08:54.381
through these forget gates,

01:08:54.381 --> 01:08:56.631
since they're all
initialized to be near one.

01:08:56.631 --> 01:08:58.934
And then throughout
the course of training,

01:08:58.935 --> 01:09:00.308
then the model can learn those biases

01:09:00.308 --> 01:09:02.742
and kind of learn to
forget where it needs to.

01:09:02.742 --> 01:09:04.886
You're right that there
still could be some potential

01:09:04.886 --> 01:09:06.246
for vanishing gradients here.

01:09:06.246 --> 01:09:07.569
But it's much less extreme

01:09:07.569 --> 01:09:09.182
than the vanilla RNN case,

01:09:09.182 --> 01:09:12.126
both because those fs can
vary at each time step,

01:09:12.126 --> 01:09:14.590
and also because we're doing

01:09:14.590 --> 01:09:15.620
this element wise multiplication

01:09:15.620 --> 01:09:19.126
rather than a full matrix multiplication.

01:09:19.126 --> 01:09:20.801
So you can see that this LSTM

01:09:20.801 --> 01:09:23.048
actually looks quite similar to ResNet.

01:09:23.048 --> 01:09:24.510
In this residual network,

01:09:24.510 --> 01:09:26.732
we had this path of identity connections

01:09:26.732 --> 01:09:28.069
going backward through the network

01:09:28.069 --> 01:09:30.362
and that gave, sort of
a gradient super highway

01:09:30.363 --> 01:09:32.303
for gradients to flow backward in ResNet.

01:09:32.303 --> 01:09:34.924
And now it's kind of the
same intuition in LSTM

01:09:34.924 --> 01:09:37.944
where these additive and element wise

01:09:37.944 --> 01:09:40.226
multiplicative interactions
of the cell state

01:09:40.227 --> 01:09:42.305
can give a similar gradient super highway

01:09:42.305 --> 01:09:44.328
for gradients to flow backwards
through the cell state

01:09:44.328 --> 01:09:45.245
in an LSTM.

01:09:46.343 --> 01:09:48.593
And by the way, there's this
other kind of nice paper

01:09:48.593 --> 01:09:49.682
called highway networks,

01:09:49.682 --> 01:09:51.808
which is kind of in between this idea

01:09:51.808 --> 01:09:53.225
of this LSTM cell

01:09:54.138 --> 01:09:56.471
and these residual networks.

01:09:57.796 --> 01:09:58.697
So these highway networks

01:09:58.697 --> 01:10:01.165
actually came before residual networks,

01:10:01.165 --> 01:10:03.008
and they had this idea where

01:10:03.008 --> 01:10:04.541
at every layer of the highway network,

01:10:04.541 --> 01:10:05.984
we're going to compute

01:10:05.984 --> 01:10:07.373
sort of a candidate activation,

01:10:07.373 --> 01:10:08.804
as well as a gating function

01:10:08.804 --> 01:10:10.792
that tells us that interprelates

01:10:10.792 --> 01:10:13.045
between our previous input at that layer,

01:10:13.045 --> 01:10:14.676
and that candidate activation

01:10:14.676 --> 01:10:17.017
that came through our
convolutions or what not.

01:10:17.017 --> 01:10:19.455
So there's actually a lot of
architectural similarities

01:10:19.455 --> 01:10:20.472
between these things,

01:10:20.472 --> 01:10:21.821
and people take a lot of inspiration

01:10:21.821 --> 01:10:24.363
from training very deep CNNs

01:10:24.363 --> 01:10:25.196
and very deep RNNs

01:10:25.196 --> 01:10:27.688
and there's a lot of crossover here.

01:10:27.688 --> 01:10:31.682
Very briefly, you'll see a
lot of other types of variance

01:10:31.682 --> 01:10:33.777
of recurrent neural network
architectures out there

01:10:33.777 --> 01:10:34.675
in the wild.

01:10:34.675 --> 01:10:36.760
Probably the most common,
apart from the LSTM,

01:10:36.760 --> 01:10:40.853
is this GRU, called the
gated recurrent unit.

01:10:40.853 --> 01:10:42.665
And you can see those
update equations here,

01:10:42.665 --> 01:10:45.701
and it kind of has this
similar flavor of the LSTM,

01:10:45.701 --> 01:10:49.318
where it uses these
multiplicative element wise gates

01:10:49.318 --> 01:10:51.417
together with these additive interactions

01:10:51.417 --> 01:10:53.828
to avoid this vanishing gradient problem.

01:10:53.828 --> 01:10:55.584
There's also this cool paper

01:10:55.584 --> 01:10:57.679
called LSTM: a search based oddysey,

01:10:57.679 --> 01:10:59.734
very inventive title,

01:10:59.734 --> 01:11:02.853
where they tried to play
around with the LSTM equations

01:11:02.853 --> 01:11:04.980
and swap out the non
linearities at one point,

01:11:04.980 --> 01:11:06.329
like do we really need that tanh

01:11:06.329 --> 01:11:07.343
for exposing the output gate,

01:11:07.343 --> 01:11:09.756
and they tried to answer a lot
of these different questions

01:11:09.756 --> 01:11:11.367
about each of those non linearities,

01:11:11.367 --> 01:11:14.017
each of those pieces of
the LSTM update equations.

01:11:14.017 --> 01:11:15.374
What happens if we change the model

01:11:15.374 --> 01:11:18.068
and tweak those LSTM
equations a little bit.

01:11:18.068 --> 01:11:19.038
And kind of the conclusion is

01:11:19.038 --> 01:11:20.260
that they all work about the same

01:11:20.260 --> 01:11:22.414
Some of them work a little
bit better than others

01:11:22.414 --> 01:11:24.521
for one problem or another.

01:11:24.521 --> 01:11:25.735
But generally, none of the things,

01:11:25.735 --> 01:11:28.036
none of the tweaks of LSTM that they tried

01:11:28.036 --> 01:11:30.911
were significantly better
that the original LSTM

01:11:30.911 --> 01:11:32.148
for all problems.

01:11:32.148 --> 01:11:33.647
So that gives you a little bit more faith

01:11:33.647 --> 01:11:36.505
that the LSTM update
equations seem kind of magical

01:11:36.505 --> 01:11:38.889
but they're useful anyway.

01:11:38.889 --> 01:11:41.084
You should probably consider
them for your problem.

01:11:41.084 --> 01:11:43.692
There's also this cool paper
from Google a couple years ago

01:11:43.692 --> 01:11:44.575
where they tried to use,

01:11:44.575 --> 01:11:46.520
where they did kind of
an evolutionary search

01:11:46.520 --> 01:11:48.117
and did a search over many,

01:11:48.117 --> 01:11:52.605
over a very large number of
random RNN architectures,

01:11:52.605 --> 01:11:55.694
they kind of randomly premute
these update equations

01:11:55.694 --> 01:11:57.936
and try putting the additions
and the multiplications

01:11:57.936 --> 01:11:59.332
and the gates and the non linearities

01:11:59.332 --> 01:12:00.860
in different kinds of combinations.

01:12:00.860 --> 01:12:03.288
They blasted this out over
their huge Google cluster

01:12:03.288 --> 01:12:04.351
and just tried a whole bunch

01:12:04.351 --> 01:12:08.570
of these different weigh
updates in various flavors.

01:12:08.570 --> 01:12:09.742
And again, it was the same story

01:12:09.742 --> 01:12:11.299
that they didn't really find anything

01:12:11.299 --> 01:12:12.607
that was significantly better

01:12:12.607 --> 01:12:15.446
than these existing GRU or LSTM styles.

01:12:15.446 --> 01:12:16.978
Although there were some
variations that worked

01:12:16.978 --> 01:12:19.518
maybe slightly better or
worse for certain problems.

01:12:19.518 --> 01:12:21.304
But kind of the take away is that

01:12:21.304 --> 01:12:24.252
probably and using an LSTM or GRU

01:12:24.252 --> 01:12:27.080
is not so much magic in those equations,

01:12:27.080 --> 01:12:29.292
but this idea of managing
gradient flow properly

01:12:29.292 --> 01:12:30.633
through these additive connections

01:12:30.633 --> 01:12:32.023
and these multiplicative gates

01:12:32.023 --> 01:12:33.356
is super useful.

01:12:34.888 --> 01:12:37.286
So yeah, the summary is
that RNNs are super cool.

01:12:37.286 --> 01:12:40.103
They can allow you to attack
tons of new types of problems.

01:12:40.103 --> 01:12:42.024
They sometimes are
susceptible to vanishing

01:12:42.024 --> 01:12:43.431
or exploding gradients.

01:12:43.431 --> 01:12:44.740
But we can address that
with weight clipping

01:12:44.740 --> 01:12:47.412
and with fancier architectures.

01:12:47.412 --> 01:12:48.899
And there's a lot of cool overlap

01:12:48.899 --> 01:12:52.043
between CNN architectures
and RNN architectures.

01:12:52.043 --> 01:12:53.856
So next time, you'll
be taking the midterm.

01:12:53.856 --> 01:12:57.856
But after that, we'll
have a, sorry, a question?

01:12:59.525 --> 01:13:00.801
Midterm is after this lecture

01:13:00.801 --> 01:13:04.142
so anything up to this point is fair game.